Gaussian Processes – Bayesian Statistics

In this module I’ll start covering additional topics in Bayesian statistics that are not covered in the specialization. Since there isn’t a course to work with I will be working with books and online workshops - trying to keep things clear and focused.

I want to begin with Gaussian Processes.

Motivational Questions

Let’s start with some motivational questions:

We often see that the BNP methods require jumping to an infinite dimensional space only to marginalize it out and get back to a finite dimensional distribution. – a meditation on random measures.

Here are some question extracted from my Bayesian Feynman Notebook.

Exercise 1 (Gaussian Processes) A problem with parametric regression models is that they impose a global structure on the data that might not be appropriate. One example we saw with mixtures is a bimodal distribution of heights. A two component mixture of Gaussians is a good model for this. However what if we need a more flexible model that can adapt to local structure in the data.

How can we do regression that is able to adapt model our data locally. I.e. a regression line that has many parameters (\mu_i, \sigma_i) for different x_i that are able to change as flexible as our DLM time series models which have an almost uncanny ability to fit complex data? We would probably need to handle regions with sparse data and regions with dense data.

Solution 1. So far we introduced the flexible Mixtures for density estimation clustering and regression. Mixture models are more flexible than any single distribution. However they are still parametric models. What we want is some kind of model that can do regression that can adapt to represent the data locally. We saw one efficient approach to this for Mixture density estimation but we need to find something more general that does not force us to focus on the number of components and that is easier to use for regression.

This is the idea behind Gaussian Processes (GPs). GPs are a type of stochastic process that can be used to define a distribution over functions. This allows us to model complex relationships in the data without having to specify a fixed number of parameters.

There are a number of ways to think about these: an generalization of the multivariate normal distribution to infinite dimensions, a distribution over functions. We should therefore look how GPs can be used for regression and how it fits into the Bayesian framework.

One well known method to do this is non parametric regression using Gaussian Processes (GPs). Will we learn about Gaussian Processes/Neural Networks in this course?

This is a type of Bayesian Non-parametric and we don’t cover these in the specialization.

Exercise 2 (What are some good resources to learn about Gaussian Processes?)

Figure 1: Gaussian Processes in Practice Workshop, Bletchley Park 2006 by David Mackay. Notes: Chapter 45 of his book

Figure 2: AMSTAT 2025 by Tamara Broderick. Slides

Figure 3: Statistical Rethinking 2023 by Richard McElreath. Material is based on his book Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Second Edition 2023 CRC Press, Taylor & Francis Group.

Figure 4

Figure 5

Solution 2.

Legendary Bayesian David Mackay asks and answers “How on earth can a plain old Gaussian distribution be useful for sophisticated regression and machine learning tasks?”” Gaussian Process Basics David Mackay part of a series Gaussian Processes in Practice Workshop, Bletchley Park 2006 Many rather advanced applications are covered in the other videos in this series.
My Starting Point is Tamara Broderick’s Gaussian Processes for Regression tutorials from tutorial_2016_mlss_cadiz slides video and code
Richard McElreath covers GP in (McElreath 2015 Ch. 16)
Here is another lecture on Gaussian Processes by Kilian Weinberger as part of a general Machine Learning course at Cornell University. The notes are available online

Machine Learning Lecture 26 “Gaussian Processes” - Cornell CS4780 SP17

This Gaussian process regression the way it’s typically explained there’s some very beautiful side of Gaussian process regression and that’s basically that it’s a prior over infinite dimensional functions and it’s so beautiful that anyone who describes it goes off for pages and pages about infinite dimensional functions and priors in that space etc which makes it really hard to understand m’s I’m trying to cut all that stuff out I haven’t found anywhere else an explanation of Gaussian processes where they don’t really manage to resist right the temptation to talk about how beautiful it is to have priors over infinite dimensional functions so make sure you pay good attention today because it’s really really hard if you wanna read the secondary reading it it will always go into that direction just because it’s it is very beautiful but it it’s hard if you haven’t heard it before. – Kilian Weinberger

Although he claims to be a frequentist this lecture presenting some Bayesian ideas and demonstrate how well the speaker has internalized this material and his ability to communicate it clearly. I enjoyed the approach of not leave any student behind but repeating the fundamental ideas and building up from there without letting this slow him down!
A deeper dive into Gaussian Processes seems to be Dr. Philipp Hennig is more abstract and mathematical than most and is thus a great reference to visit once you are no longer worries by notation and are looking for greater abstraction.
The key reference is the book by Carl Edward Rasmussen and Christopher K. I. Williams, “Gaussian Processes for Machine Learning” (Rasmussen and Williams 2006). The book is available online for free here.
Even more advanced is the GPSS YouTube channel with 60+ videos at different levels
I found a couple of talks by Rasmussen from the 2009 Machine Learning Summer School in the UK that may be of interest:
- Gaussian Processes - part 1
- Gaussian Processes - part 2
- He says that it took him about two years to understand Gaussian Processes.
- Why do we need the infinite dimensional view of GPs when in the end we work with finite dimensional distributions?
  - Because we need to find where the RV of interest lay.
  - There is a statistical reason: the finite dimensional distributions are just projections of the infinite dimensional distribution.

Bottom line I find Tamara Broderick’s tutorials to be the most accessible and practical introduction to Gaussian Processes, I think that after covering it I am quite ready to dive deeper into the MacKay and Rasmussen & Williams books.

Exercise 3 (Fully Bayesian Mixture Models)

How can we include the size of a Mixture as part of the inference in a Bayesian Mixture Model?

Solution 3. My initial instinct is to use a Cauchy prior and use HMC for inference. It seems that this might not work too well as HMC and NUTS are not good at jumping between modes.

So the second idea is to go back to the Gibbs sampler and add a step to sample the number of components. This time Cauchy prior might not be such a good idea (it has no moments). So I would go with a Poisson prior or a Negative Binomial prior might be better. Note though that both the Poisson and Negative Binomial will require setting parameters that place the mass at some mean value. What if we don’t know this value and it might be very big? Or worse what if we are streaming in data (perhaps using a Conditional Conjugate model) and we keep seeing new clusters? e.g. for modeling topics of Wikipedia articles.

I think this might work fine but I suspect this isn’t ideal either as this is not the setup you see mentioned if you do a bit of research on the web. In (Miller and Harrison 2018) the authors review the two main approaches to this problem and find a number of parallels between them. They cover mixture of finite mixtures (MFM) and Dirichlet process mixture (DPM) models.

My idea fits into MFM approach but they point out though that MFM models require sophisticated sampling setups based on reversible jump Markov chain Monte Carlo, but it can be nontrivial to design good reversible jump moves, especially in high-dimensional space. So it might be neat to implement this for the toy examples in the course like the galaxy data set, but that it is unlikely to be useful for more complex models.

Exercise 4 What are the main mathematical results we need to understand BNP models? Can we further organize these as prerequisites and results that are developed for BNP models?

Prerequisites

Doob’s theorem
- This is a key result in Bayesian statistics that shows that under certain conditions, the posterior distribution converges to the true parameter value as the sample size increases.
- This is important for understanding the asymptotic behavior of BNP models.
Kolmogorov extension theorem
- This is a key result in probability theory that shows that under certain conditions, a collection of finite-dimensional distributions can be extended to a consistent infinite-dimensional distribution.
- This is important for understanding how BNP models can be defined in terms of finite-dimensional distributions.

Key Concepts and Results

So far I have found the following key concepts and results that are important to understand BNP models:

Exchangeability and De Finetti’s theorem
- This is a key result that shows that any exchangeable sequence of random variables can be represented as a mixture of i.i.d. sequences. This is important for understanding how BNP models can be used to model complex data.
- Tamara Broderick point out a number of papers in which de Finetti’s theorem was proved and extended by various researchers over the years which now covers the case of infinite sequences of random variables.
- In Abel Rodriguez’s notes he points out that there are a number of stronger and weaker forms of exchangeability. These ideas are intuitive but formalizing them is more challenging and knowing the terminology can make our thinking more precise.
  - One idea that is more of less obvious from the canonical first example of clustering is that we can exchange the labels of the clusters. This is called label switching and is a common problem in Bayesian mixture models.
  - Changing the order the point are sampled shouldn’t matter.
  - However we still want point within a cluster to be more similar to each other than to points in other clusters.
  - More generally for different constructs like trees, graphs, time series, we want to have the benefits of de Finetti’s theorem but we want to respect the dependency structure of the data structure.
Stochastic Processes
- A stochastic process generalists a random variable to a collection of random variables indexed by time or space. This index can be discrete or continuous.
- Some Examples of stochastic processes we will cover include the Gaussian process, the Dirichlet process, the Chinese restaurant process, the Indian buffet process, and the Pitman-Yor process.
Conjugacy
- This is a key concept in Bayesian statistics that allows us to update our beliefs about a parameter in a computationally efficient way. Many BNP models are based on conjugate priors, which makes inference easier.
Measure Theory
- This is a branch of mathematics that deals with the formalization of the intuitive probability mass spread on the support of a distribution.
- We will need enough to define random measures which allow to view BNP using an abstraction that makes most of the look very similar so that understating one (Say the Dirichlet process) will make it easier to understand many others.

Overview

In this course we learn to:
- Demonstrate a wide range of skills and knowledge in Bayesian statistics.
- Explain essential concepts in Bayesian statistics.
- Apply what you know to real-world data.
We will build the following skills:
- Probability Distribution (Dirichlet, Beta)
- Stochastic Processes (Gaussian Proceess, Poisson Process, Dirichlet Process, Chinese Restaurant Process, Indian Buffet Process, Pitman-Yor Process, Polya Tree)
There are currently five modules planned in this course:

Gaussian Processes for Regression and optimzation

We will focus on Gaussian processes as a flexible prior distribution for regression problems, allowing us to capture complex relationships in the data.

Gaussian process model slides
Gaussian process regression slides
Squared exponential kernel and observation noise slides
What uncertainty are we quantifying? slides
A list of resources: slide

Dirichlet process & Chinese restaurant process

We will explore the Dirichlet process as a prior distribution over probability measures, allowing for flexible modeling of unknown distributions.

The Beta distribution
The Dirichlet Distribution
Dirichlet process
Polya urn scheme
Stick breaking representation
Dirichlet process mixture models
Hierarchical Dirichlet processes
Chinese restaurant process We will introduce the Chinese restaurant process as a metaphor for the Dirichlet process, providing an intuitive understanding of how it works.

Indian buffet process

We will discuss the Indian buffet process as a model for representing the distribution of features in a dataset, allowing for flexible and scalable modeling of complex data structures.

Pitman-yor process

We will examine the Pitman-yor process as a generalization of the Dirichlet process, providing a more flexible framework for modeling distributions with power-law behavior.

Polya tree

We will explore the Polya tree as a nonparametric prior distribution for modeling probability distributions, allowing for flexible and adaptive modeling of complex data structures.

Some References:

(Wasserman 2006) All of nonparametric bayesian statistics
(Rasmussen and Williams 2006) Gaussian Processes
(Gramacy 2020) Surrogates
(Garnett 2023) Bayesian Optimization
(Frazier 2018) A Tutorial on Bayesian Optimization

It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample
three common acquisition functions:
- expected improvement
- entropy search
- knowledge gradient

(Lee 2004) Bayesian Nonparametrics via Neural Networks

So after going over the first two tutorials by Tamara Broderick, I feel like I have a good understanding of Gaussian Processes and the Dirichlet Process. This seems like a very good starting point for learning more.

I listened felt kind of more comfortable to listen and read Larry Wasserman’s on this BNP and als FNPs. IT seems like it is also useful to go over Frequents non-parametric methods as they are different from Bayesian non-parametric methods. (For parametric models, there is a good case that Bayesian and Frequentist methods are not only similar but if one uses a reference prior, they can lead to the same results. Wasserman makes a strong point when explaining non parametric CDF that the Bayesian and Frequentist approaches are doing different things and that the distinction is interesting.

Papers

Most of the work on BNP is covered in papers.
It is strange but I have already become familier with the contents of a number of papers on BNP without realizing it. - In some cases I just know the key result/advancement associated with the paper.
Papers come in many flavours
- The very abstract and require training in advanced probability theory etc to make any sense.
- In some papers which introduce new models, the authors do provide very lucid explanations of the results from the more abstract papers. e.g. (Yee Whye Teh 2006) and [Fox_2011]

Foundational papers

(Kingman 1967) Completely random measures
(Ferguson 1973) A Bayesian analysis of some nonparametric problems
(Kingman 1978) The representation of partition structures
(Sethuraman 1994) A constructive definition of Dirichlet priors
(Blackwell and MacQueen 1973) A Polya urn scheme for the Dirichlet process
Chinese restaurant process
Indian buffet process
Pitman-Yor process
Polya tree
Nested Chinese restaurant process
Hierarchical Dirichlet process
Dependent Dirichlet process

In fact in (Wasserman 2006) points out that he doesn’t give proofs for many of the results and point us to the original papers.
Tamara Broderick’s tutorial notes provide a bibliography with the papers that are referenced in her tutorials.
- Some of these she emphasizes as being particularly important, others are just sources for examples. However in retrospect an example for a problem like topic modeling is going to be quite different from time series etc. So that having examples covering BNP for different types of data is central to understanding how to use these models in practice.
- She also points out that there were two eras of research in BNP. The first in which there were many theoretical results and a second are after MCMC methods became strong enough to handle the computational complexity of BNP models. In the second era people focused more on using MCMC to do inference on many of the models developed in the first era.
- In another tutorial she points out that for many of the models there are still many open problems and that these may need to be answered before one can use these models for inference of parameters of interest to a specific problem and a specific data set.
At this point I want to get good enough to do inference on some of these models and then to adapt them to a few problems I have in mind. Like Lewis signaling, Elasticity based Pricing, and use in Bayesian RL agents.
So while I don’t think these are easy reads I will
1. first try to get a better understanding of the key concepts and ideas.
2. Try to do exercises in the notes and tutorials.
3. Try to implement some of the models in R and Python.
4. Try to use AI to breeze through the most important papers

So far (Ferguson 1973) and (Sudderth and Jordan 2008) seem the most interesting papers to start with for Dirichlet Processes. There is also (Kingman 1967) which is a key paper on random measures and (Kingman 1978) with the Kingman paintbox process.

Also the papers on De Finetti’s theorem and exchangeability are important yet very challenging. Tamara Broderick’s tutorial talks about this result. On the other hand Larry Wasserman blurted out that you can’t use Bayes theorem to do inference when the measure is not sigma-finite. So we need to be careful when using these models. This burst a bunch of bubbles. But like how I felt about Pappus’s hexagon theorem being correct in projective geometry, even when we discard the parallel postulate, I feel like this is a good thing. We see once again that Bayes theorem is not the result about which all of Bayesian statistics revolves around. We see once again the power of conjugacy and the definition of Joint distributions as conditional probabilities are more fundamental. And we also see that MCMC can help us do inference in many cases were we can’t use Bayes and where even Conjugacy breaks down. We can use sampling to get a good enough approximation to the posterior even when the measure we want to consider is not sigma-finite. Finally MCMC is also very useful for RL and one of my goals is to use BNP methods to make better RL agents.

Additional Resources

Workshps

Tamara Broderick from UCSC has given a number of workshop that are available on her personal website and her YouTube channel. Some of these are:

Gaussian Processes
Dirichlet Process
Variational Bayesian I found these as the best starting points
2024 slides
Tamara Broderick Tutorial on the Dirichlet Process from MLSS 2016 ML the University of Cádiz in Cádiz, Spain link

Abel Rodriguez from UCSC has a website the instructor of the third course on mixture model taught a short BNP course | L1, L2, L3, L4, L5
Athanasios Kottas from * UCSC* home page of has made the following notes available on his website:

Tutorial on Nonparametric Bayesian density regression: modeling methods and applications
- Short course on Applied Bayesian Nonparametric Mixture Modeling with references (16 pages) with 52 of them being his own papers.

Also I noticed that there are some reference book that teach non-parametric statistics that people used to go to to get started. It might be worthwhile to list these along with what they cover.

Also there are a number of young people who have made notes and tutorials available on the web. On the other hand some older people who invented things have also made notes available.

Peter Orbanz from University College London Tutorials on Bayesian Nonparametrics

He recommends (Kallenberg 1997)

Yee Whye Teh from University of Oxford home page has made written some papers on some topics I am intersted in

(Teh and Jordan 2010) “Hierarchical {B}ayesian Nonparametric Models with Applications” offers a whirlwind tour of BNP models and their applications. Similar to Tamara Broderick’s segment on other models in her tutorial but 48 pages worth.
- Chinese Restaurant Franchise
- Hidden Markov Models with Infinite State Spaces - Could help RL agents use finite data bound approximation to infinite state spaces.
- Hierarchical Pitman-Yor Process - for language modeling I think that this one is great intro for rolling your own HBNP models. With strong intuition for topic modeling.
(Hinton et al. 2006) “A Fast Learning Algorithm For Deep Belief Networks.” because RL agents are overconfident and could use a fast learning algorithm that factors in uncertainty.
(Y. W. Teh 2006) “A Hierarchical Bayesian Language Model based on Pitman-Yor Processes.” may be a the language model that uses Lewis Signaling solution as its support
(Wood et al. 2011) “The Sequence Memoizer”

Courses from GP Summer School

This youyube channel has a number of videos from the Gaussian Process Summer School. I haven’t watched all of them but they seem to be a great resource for learning about Gaussian Processes and their applications.

2024 - 8 videos

Video 1

2021 - 9 videos

Video 2

2020 - 9 videos

Video 3

2019 - 12 videos

Video 4

Software

Stan is a probabilistic programming language that allows you to specify and fit Bayesian models. It has a large community and a lot of resources available for learning how to use it.
TensorFlow Probability is a library for probabilistic modeling and inference in TensorFlow. It allows you to build and train complex probabilistic models using TensorFlow’s computational graph.
BayesFlow
Pyro is a probabilistic programming language built on top of PyTorch. It allows you to specify and fit Bayesian models using PyTorch’s computational graph.
PyMC is a Python library for probabilistic programming that allows you to specify and fit Bayesian models using MCMC and variational inference.

Blackwell, David, and James B MacQueen. 1973. “Ferguson Distributions Via Polya Urn Schemes.” The Annals of Statistics 1 (2): 353–55. https://doi.org/10.1214/aos/1176342372.

Ferguson, Thomas S. 1973. “A Bayesian Analysis of Some Nonparametric Problems.” The Annals of Statistics, 209–30.

Frazier, Peter I. 2018. A Tutorial on Bayesian Optimization. https://arxiv.org/abs/1807.02811.

Garnett, R. 2023. Bayesian Optimization. Cambridge University Press. https://github.com/bayesoptbook/bayesoptbook.github.io.

Gramacy, Robert B. 2020. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. Chapman Hall/CRC.

Hinton, G. E., S. Osindero, and Y. W. Teh. 2006. “A Fast Learning Algorithm for Deep Belief Networks.” Neural Computation 18 (7): 1527–54.

Kallenberg, Olav. 1997. Foundations of Modern Probability. Springer.

Kingman, J. F. C. 1978. “The Representation of Partition Structures.” Journal of the London Mathematical Society s2-18 (2): 374–80.

Kingman, John. 1967. “Completely Random Measures.” Pacific Journal of Mathematics 21 (1): 59–78.

Lee, H. K. H. 2004. Bayesian Nonparametrics via Neural Networks. ASA-SIAM Series on Statistics and Applied Probability. Society for Industrial; Applied Mathematics (SIAM, 3600 Market Street, Floor 6, Philadelphia, PA 19104). https://books.google.co.il/books?id=PoZFyNgitIYC.

McElreath, Richard. 2015. Statistical Rethinking, a Course in r and Stan.

Miller, Jeffrey W, and Matthew T Harrison. 2018. “Mixture Models with a Prior on the Number of Components.” Journal of the American Statistical Association 113 (521): 340–56.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning Series. MIT Press. https://gaussianprocess.org/gpml/.

Sethuraman, Jayaram. 1994. “A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS.” Statistica Sinica 4 (2): 639–50. https://www3.stat.sinica.edu.tw/statistica/oldpdf/A4n216.pdf.

Sudderth, Erik, and Michael Jordan. 2008. “Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes.” In Advances in Neural Information Processing Systems, edited by D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, vol. 21. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2008/file/883e881bb4d22a7add958f2d6b052c9f-Paper.pdf.

Teh, Y. W. 2006. “A Hierarchical Bayesian Language Model Based on Pitman-Yor Processes.” Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 985–92. http://www.aclweb.org/anthology/P/P06/P06-1124.

Teh, Y. W., and M. I. Jordan. 2010. “Hierarchical Bayesian Nonparametric Models with Applications.” In Bayesian Nonparametrics: Principles and Practice, edited by N. Hjort, C. Holmes, P. Müller, and S. Walker. Cambridge University Press. https://www.stats.ox.ac.uk/~teh/research/npbayes/TehJor2010a.pdf.

Teh, Yee Whye. 2006. A Bayesian Interpretation of Interpolated Kneser-Ney.

Wasserman, L. 2006. All of Nonparametric Statistics. Springer Texts in Statistics. Springer New York. https://books.google.co.il/books?id=MRFlzQfRg7UC.

Wood, F., J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. 2011. “The Sequence Memoizer.” Communications of the Association for Computing Machines 54 (2): 91–98.

--- date: 2025-07-02 title: "Gaussian Processes" subtitle: "Bayesian Non-Parametric Models" description: "Seminar on Advanced Topics in Bayesian Statistics" categories: - Bayesian Statistics keywords: - Bayesian NonParametrics - BNP --- In this module I'll start covering additional topics in Bayesian statistics that are not covered in the specialization. Since there isn't a course to work with I will be working with books and online workshops - trying to keep things clear and focused. I want to begin with Gaussian Processes. ::: {.callout-tip} ### What's Next? So I want to start out mapping out the new territory by finding some tutorials and courses on the web that ease one into these new topics. Like a virtual seminar with guest lecturers and hands on workshops. The best of these should let one build enough intuition to dive deeper into a an online course, a textbook, or even research papers if there material is very new. Last I checked there isn't a course on Bayesian Non-parametric models on Coursera. At least not in English. But learning never ends. - [X] Bayesian and Frequentist Non-Parametric [@wasserman2006all] - [x] GP [@rasmussen2006gaussian], Surrogates [@gramacy2020surrogates] and Bayesian Optimization [@garnett2023bayesian] - [x] DP, CRP, Stick breaking, - [ ] Trees based models Pitman-Yor, Bayesian CART, BART - [ ] IBP AND Some related Graph based models - [ ] Density estimation and CDF estimation - [ ] Variational Inference, Copulas, Normalizing Flows (Deep learning) - [ ] More Advanced MCMC methods (HMC, NUTS, SMC, Variational Bayes, etc) - [ ] Probabilistic Programming (Stan, PyMC, Turing.jl, etc) - [ ] The INLA approach to Bayesian inference - More advanced Mixtures and State Space models - [ ] Mixture of Experts - [ ] Markov Switching Models - [ ] Bayesian Causal Inference - [ ] Bayesian Networks, Graphical Models - [ ] BART - [ ] Bayesian Reinforcement Learning - [ ] Bayesian Game Theory This is not available as a course on Coursera and isn't a part of the specialization which ended in the last course. So this notes are my own personal notes gathered from tutorials and courses I found on the web. Also there are a number of very gifted people teaching and researching BNP. Some teach very formal courses with lemmas and theorems, others take a more informal and hands on approach presenting ideas and recent results from research. Seeing theorems and their proofs is important but I find it very boring and hard to follow without a good deal of motivation and intuition. So I wish to prioritize the more informal and intuitive approach and build up from there. ::: ## Motivational Questions Let's start with some motivational questions: > We often see that the BNP methods require jumping to an infinite dimensional space only to marginalize it out and get back to a finite dimensional distribution. -- a meditation on random measures. Here are some question extracted from my Bayesian Feynman Notebook. ::: {#exr-bnp-intro-01} ## Gaussian Processes A problem with parametric regression models is that they impose a global structure on the data that might not be appropriate. One example we saw with mixtures is a bimodal distribution of heights. A two component mixture of Gaussians is a good model for this. However what if we need a more flexible model that can adapt to local structure in the data. How can we do regression that is able to adapt model our data locally. I.e. a regression line that has many parameters $(\mu_i, \sigma_i)$ for different $x_i$ that are able to change as flexible as our DLM time series models which have an almost uncanny ability to fit complex data? We would probably need to handle regions with sparse data and regions with dense data. ::: ::: {#sol-bnp-intro-01} So far we introduced the flexible Mixtures for density estimation clustering and regression. Mixture models are more flexible than any single distribution. However they are still parametric models. What we want is some kind of model that can do regression that can adapt to represent the data locally. We saw one efficient approach to this for Mixture density estimation but we need to find something more general that does not force us to focus on the number of components and that is easier to use for regression. This is the idea behind Gaussian Processes (GPs). GPs are a type of stochastic process that can be used to define a distribution over functions. This allows us to model complex relationships in the data without having to specify a fixed number of parameters. There are a number of ways to think about these: an generalization of the multivariate normal distribution to infinite dimensions, a distribution over functions. We should therefore look how GPs can be used for regression and how it fits into the Bayesian framework. One well known method to do this is non parametric regression using Gaussian Processes (GPs). Will we learn about Gaussian Processes/Neural Networks in this course? - This is a type of Bayesian Non-parametric and we don't cover these in the specialization. ::: ::: {#exr-bnp-intro-01-1} ## What are some good resources to learn about Gaussian Processes? ::: :::: {.column-margin #fig-bnp-intro-01} {{< video https://youtu.be/NegVuuHwa8Q title='Gaussian Process Basics David Mackay' >}} Gaussian Processes in Practice Workshop, Bletchley Park 2006 by David Mackay. Notes: Chapter 45 of his book :::: :::: {.column-margin #fig-bnp-intro-02} {{< video https://www.youtube.com/watch?v=szuREe_brxw&ab_channel=AmstatVideos title='Gaussian Processes for Regression: Models, Algorithms, and Inference' >}} AMSTAT 2025 by Tamara Broderick. [Slides](https://higherlogicdownload.s3.amazonaws.com/AMSTAT/ecde5d85-2637-4fca-a4df-c7beddf34426/UploadedImages/broderick_tutorial_2025_asa_cirs_gps.pdf) :::: --- :::: {.column-margin #fig-bnp-intro-03} {{< video https://www.youtube.com/watch?v=Y2ZLt4iOrXU title='Richard McElreath Statistical Rethinking 2023 - 16 - Gaussian Processes' >}} Statistical Rethinking 2023 by **Richard McElreath**. Material is based on his book [Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Second Edition](https://xcelab.net/rm/statistical-rethinking/) 2023 CRC Press, Taylor & Francis Group. :::: --- :::: {.column-margin #fig-bnp-intro-04} {{< video https://youtu.be/R-NUdqxKjos title='Kilian Weinberger - Machine Learning Lecture 26 "Gaussian Processes" -Cornell CS4780 SP17' >}} :::: --- :::: {.column-margin #fig-bnp-intro-05} {{< video https://www.youtube.com/watch?v=_XjzfdJ7Vtg title='Probabilistic ML - Lecture 8 - Gaussian Processes' >}} :::: --- ::: {#sol-bnp-intro-01-1} * Legendary Bayesian David Mackay asks and answers "How on earth can a plain old Gaussian distribution be useful for sophisticated regression and machine learning tasks?"" [Gaussian Process Basics David Mackay](https://www.youtube.com/watch?v=NegVuuHwa8Q_) part of a series [Gaussian Processes in Practice Workshop, Bletchley Park 2006](https://videolectures.net/events/gpip06_bletchley_park) Many rather advanced applications are covered in the other videos in this series. * My Starting Point is Tamara Broderick's Gaussian Processes for Regression tutorials from tutorial_2016_mlss_cadiz [slides](https://tamarabroderick.com/tutorial_2016_mlss_cadiz.html) [video](https://www.youtube.com/watch?v=szuREe_brxw&ab_channel=AmstatVideos) and [code](https://github.com/tbroderick/gps_tutorial/tree/2025asa_cirs) * **Richard McElreath** covers GP in [@McElreath2020Rethinking Ch. 16] * Here is another lecture on Gaussian Processes by **Kilian Weinberger** as part of a general Machine Learning course at Cornell University. The [notes are available online](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote15.html) Machine Learning Lecture 26 "Gaussian Processes" - Cornell CS4780 SP17 > This Gaussian process regression the way it's typically explained there's some very beautiful side of Gaussian process regression and that's basically that it's a prior over infinite dimensional functions and it's so beautiful that anyone who describes it goes off for pages and pages about infinite dimensional functions and priors in that space etc which makes it really hard to understand m's I'm trying to cut all that stuff out I haven't found anywhere else an explanation of Gaussian processes where they don't really manage to resist right the temptation to talk about how beautiful it is to have priors over infinite dimensional functions so make sure you pay good attention today because it's really really hard if you wanna read the secondary reading it it will always go into that direction just because it's it is very beautiful but it it's hard if you haven't heard it before. -- Kilian Weinberger * Although he claims to be a frequentist this lecture presenting some Bayesian ideas and demonstrate how well the speaker has internalized this material and his ability to communicate it clearly. I enjoyed the approach of not leave any student behind but repeating the fundamental ideas and building up from there without letting this slow him down! * A deeper dive into Gaussian Processes seems to be Dr. Philipp Hennig is more abstract and mathematical than most and is thus a great reference to visit once you are no longer worries by notation and are looking for greater abstraction. * The key reference is the book by Carl Edward Rasmussen and Christopher K. I. Williams, "Gaussian Processes for Machine Learning" [@rasmussen2006gaussian]. The book is available online for free [here](http://www.gaussianprocess.org/gpml/). * Even more advanced is the [GPSS YouTube channel](https://www.youtube.com/@gaussianprocesssummerschoo7738/videos) with 60+ videos at different levels * I found a couple of talks by Rasmussen from the 2009 Machine Learning Summer School in the UK that may be of interest: - [Gaussian Processes - part 1](https://videolectures.net/videos/mlss09uk_rasmussen_gp) - [Gaussian Processes - part 2](https://videolectures.net/videos/mlss09uk_rasmussen_gp?part=2) - He says that it took him about two years to understand Gaussian Processes. - Why do we need the infinite dimensional view of GPs when in the end we work with finite dimensional distributions? - Because we need to find where the RV of interest lay. - There is a statistical reason: the finite dimensional distributions are just projections of the infinite dimensional distribution. **Bottom line** I find Tamara Broderick's tutorials to be the most accessible and practical introduction to Gaussian Processes, I think that after covering it I am quite ready to dive deeper into the MacKay and Rasmussen & Williams books. ::: ::: {#exr-bnp-intro-02} ## Fully Bayesian Mixture Models 1. How can we include the size of a Mixture as part of the inference in a Bayesian Mixture Model? ::: ::: {#sol-bnp-intro-02} My initial instinct is to use a Cauchy prior and use HMC for inference. It seems that this might not work too well as HMC and NUTS are not good at jumping between modes. So the second idea is to go back to the Gibbs sampler and add a step to sample the number of components. This time Cauchy prior might not be such a good idea (it has no moments). So I would go with a Poisson prior or a Negative Binomial prior might be better. Note though that both the Poisson and Negative Binomial will require setting parameters that place the mass at some mean value. What if we don't know this value and it might be very big? Or worse what if we are streaming in data (perhaps using a Conditional Conjugate model) and we keep seeing new clusters? e.g. for modeling topics of Wikipedia articles. I think this might work fine but I suspect this isn't ideal either as this is not the setup you see mentioned if you do a bit of research on the web. In [@miller2018mixture] the authors review the two main approaches to this problem and find a number of parallels between them. They cover mixture of finite mixtures (MFM) and Dirichlet process mixture (DPM) models. My idea fits into MFM approach but they point out though that MFM models require sophisticated sampling setups based on reversible jump Markov chain Monte Carlo, but it can be nontrivial to design good reversible jump moves, especially in high-dimensional space. So it might be neat to implement this for the toy examples in the course like the galaxy data set, but that it is unlikely to be useful for more complex models. :::  ::: {#exr-bnp-intro-03} What are the main mathematical results we need to understand BNP models? Can we further organize these as prerequisites and results that are developed for BNP models? ::: ::: {.callout-note} ## Prerequisites - [Doob's theorem](https://en.wikipedia.org/wiki/Doob%27s_theorem) - This is a key result in Bayesian statistics that shows that under certain conditions, the posterior distribution converges to the true parameter value as the sample size increases. - This is important for understanding the asymptotic behavior of BNP models. - [Kolmogorov extension theorem](https://en.wikipedia.org/wiki/Kolmogorov_extension_theorem) - This is a key result in probability theory that shows that under certain conditions, a collection of finite-dimensional distributions can be extended to a consistent infinite-dimensional distribution. - This is important for understanding how BNP models can be defined in terms of finite-dimensional distributions. ::: ::: {.callout-tip} ## Key Concepts and Results So far I have found the following key concepts and results that are important to understand BNP models: - Exchangeability and De Finetti's theorem - This is a key result that shows that any exchangeable sequence of random variables can be represented as a mixture of i.i.d. sequences. This is important for understanding how BNP models can be used to model complex data. - Tamara Broderick point out a number of papers in which de Finetti's theorem was proved and extended by various researchers over the years which now covers the case of infinite sequences of random variables. - In Abel Rodriguez's notes he points out that there are a number of stronger and weaker forms of exchangeability. These ideas are intuitive but formalizing them is more challenging and knowing the terminology can make our thinking more precise. - One idea that is more of less obvious from the canonical first example of clustering is that we can exchange the labels of the clusters. This is called [label switching](https://en.wikipedia.org/wiki/Label_switching) and is a common problem in Bayesian mixture models. - Changing the order the point are sampled shouldn't matter. - However we still want point within a cluster to be more similar to each other than to points in other clusters. - More generally for different constructs like trees, graphs, time series, we want to have the benefits of de Finetti's theorem but we want to respect the dependency structure of the data structure. - Stochastic Processes - A stochastic process generalists a random variable to a collection of random variables indexed by time or space. This index can be discrete or continuous. - Some Examples of stochastic processes we will cover include the Gaussian process, the Dirichlet process, the Chinese restaurant process, the Indian buffet process, and the Pitman-Yor process. - Conjugacy - This is a key concept in Bayesian statistics that allows us to update our beliefs about a parameter in a computationally efficient way. Many BNP models are based on conjugate priors, which makes inference easier. - Measure Theory - This is a branch of mathematics that deals with the formalization of the intuitive probability mass spread on the support of a distribution. - We will need enough to define random measures which allow to view BNP using an abstraction that makes most of the look very similar so that understating one (Say the Dirichlet process) will make it easier to understand many others. ::: ## Overview  - In this course we learn to: - [x] Demonstrate a wide range of skills and knowledge in Bayesian statistics. - [x] Explain essential concepts in Bayesian statistics. - [x] Apply what you know to real-world data. - We will build the following skills: - Probability Distribution (Dirichlet, Beta) - Stochastic Processes (Gaussian Proceess, Poisson Process, Dirichlet Process, Chinese Restaurant Process, Indian Buffet Process, Pitman-Yor Process, Polya Tree) - There are currently five modules planned in this course: ### Gaussian Processes for Regression and optimzation We will focus on Gaussian processes as a flexible prior distribution for regression problems, allowing us to capture complex relationships in the data. 1. Gaussian process model [slides](https://tamarabroderick.com/files/broderick_2024_cps-fr_part_i.pdf) 2. Gaussian process regression [slides](https://tamarabroderick.com/files/broderick_2024_cps-fr_part_ii.pdf) 3. Squared exponential kernel and observation noise [slides](https://tamarabroderick.com/files/broderick_2024_cps-fr_part_iii.pdf) 4. What uncertainty are we quantifying? [slides](https://tamarabroderick.com/files/broderick_2024_cps-fr_part_iv.pdf) 5. A list of resources: [slide](https://tamarabroderick.com/files/broderick_2024_cps-fr_resources.pdf) ### Dirichlet process & Chinese restaurant process We will explore the Dirichlet process as a prior distribution over probability measures, allowing for flexible modeling of unknown distributions. - The Beta distribution - The Dirichlet Distribution - Dirichlet process - Polya urn scheme - Stick breaking representation - Dirichlet process mixture models - Hierarchical Dirichlet processes - Chinese restaurant process We will introduce the Chinese restaurant process as a metaphor for the Dirichlet process, providing an intuitive understanding of how it works. ### Indian buffet process We will discuss the Indian buffet process as a model for representing the distribution of features in a dataset, allowing for flexible and scalable modeling of complex data structures. ### Pitman-yor process We will examine the Pitman-yor process as a generalization of the Dirichlet process, providing a more flexible framework for modeling distributions with power-law behavior. ### Polya tree We will explore the Polya tree as a nonparametric prior distribution for modeling probability distributions, allowing for flexible and adaptive modeling of complex data structures. ## Prerequisite skill checklist :spiral_notepad: :::: {.callout-note collapse="true"} ## Prerequisite skill checklist ### Bayesian Statistics - [x] Interpret and specify the components of Bayesian statistical models: likelihood, prior, posterior. - [x] Explain the basics of sampling algorithms, including sampling from standard distributions and using MCMC to sample from non-standard posterior distributions. - [x] Assess the performance of a statistical model and compare competing models using posterior samples. - [x] Coding in R to achieve the above tasks. ### Mixture Models - [x] Define mixture model. - [x] Explain the likelihood function associated with a random sample from a mixture distribution. - [x] Derive Markov chain Monte Carlo algorithms for fitting mixture models. - [x] Coding in R to achieve the above tasks. ### Time Series Analysis - [x] Define time series and stochastic processes (univariate, discrete-time, equally-spaced) - [x] Define strong and weak stationarity - [x] Define the auto-covariance function, the auto-correlation function (ACF), and the partial autocorrelation function (PACF) - [x] Definition of the general class of autoregressive processes of order p. ::: ## Some References: 1. [@wasserman2006all] All of nonparametric bayesian statistics 1. [@rasmussen2006gaussian] Gaussian Processes 1. [@gramacy2020surrogates] Surrogates 1. [@garnett2023bayesian] Bayesian Optimization 1. [@frazier2018tutorialbayesianoptimization] A Tutorial on Bayesian Optimization - It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample - three common acquisition functions: - expected improvement - entropy search - knowledge gradient 1. [@lee2004bayesian] Bayesian Nonparametrics via Neural Networks So after going over the first two tutorials by Tamara Broderick, I feel like I have a good understanding of Gaussian Processes and the Dirichlet Process. This seems like a very good starting point for learning more. I listened felt kind of more comfortable to listen and read Larry Wasserman's on this BNP and als FNPs. IT seems like it is also useful to go over Frequents non-parametric methods as they are different from Bayesian non-parametric methods. (For parametric models, there is a good case that Bayesian and Frequentist methods are not only similar but if one uses a reference prior, they can lead to the same results. Wasserman makes a strong point when explaining non parametric CDF that the Bayesian and Frequentist approaches are doing different things and that the distinction is interesting. ## Papers - Most of the work on BNP is covered in papers. - It is strange but I have already become familier with the contents of a number of papers on BNP without realizing it. - In some cases I just know the key result/advancement associated with the paper. - Papers come in many flavours - The very *abstract* and require training in advanced probability theory etc to make any sense. - In some papers which introduce new models, the authors do provide very lucid explanations of the results from the more *abstract* papers. e.g. [@teh2006bayesian] and [Fox_2011] ::: {.callout-note} ## Foundational papers - [@kingman1967completely] Completely random measures - [@ferguson1973bayesian] A Bayesian analysis of some nonparametric problems - [@KingmanJ.F.C.1978TRoP] The representation of partition structures - [@Sethuraman1994] A constructive definition of Dirichlet priors - [@blackwell1973ferguson] A Polya urn scheme for the Dirichlet process - [ ] Chinese restaurant process - [ ] Indian buffet process - [ ] Pitman-Yor process - [ ] Polya tree - [ ] Nested Chinese restaurant process - [ ] Hierarchical Dirichlet process - [ ] Dependent Dirichlet process ::: - In fact in [@wasserman2006all] points out that he doesn't give proofs for many of the results and point us to the original papers. - Tamara Broderick's tutorial notes provide a bibliography with the papers that are referenced in her tutorials. - Some of these she emphasizes as being particularly important, others are just sources for examples. However in retrospect an example for a problem like topic modeling is going to be quite different from time series etc. So that having examples covering BNP for different types of data is central to understanding how to use these models in practice. - She also points out that there were two eras of research in BNP. The first in which there were many theoretical results and a second are after MCMC methods became strong enough to handle the computational complexity of BNP models. In the second era people focused more on using MCMC to do inference on many of the models developed in the first era. - In another tutorial she points out that for many of the models there are still many open problems and that these may need to be answered before one can use these models for inference of parameters of interest to a specific problem and a specific data set. - At this point I want to get good enough to do inference on some of these models and then to adapt them to a few problems I have in mind. Like Lewis signaling, Elasticity based Pricing, and use in Bayesian RL agents. - So while I don't think these are easy reads I will 1. first try to get a better understanding of the key concepts and ideas. 2. Try to do exercises in the notes and tutorials. 3. Try to implement some of the models in R and Python. 4. Try to use AI to breeze through the most important papers So far [@ferguson1973bayesian] and [@Sudderth2008SharedSegmentation] seem the most interesting papers to start with for Dirichlet Processes. There is also [@kingman1967completely] which is a key paper on random measures and [@KingmanJ.F.C.1978TRoP] with the Kingman paintbox process. Also the papers on De Finetti's theorem and exchangeability are important yet very challenging. Tamara Broderick's tutorial talks about this result. On the other hand Larry Wasserman blurted out that you can't use Bayes theorem to do inference when the measure is not [sigma-finite](https://en.wikipedia.org/wiki/%CE%A3-finite_measure). So we need to be careful when using these models. This burst a bunch of bubbles. But like how I felt about [Pappus's hexagon theorem](https://en.wikipedia.org/wiki/Pappus%27s_hexagon_theorem) being correct in projective geometry, even when we discard the parallel postulate, I feel like this is a good thing. We see once again that Bayes theorem is not the result about which all of Bayesian statistics revolves around. We see once again the power of conjugacy and the definition of Joint distributions as conditional probabilities are more fundamental. And we also see that MCMC can help us do inference in many cases were we can't use Bayes and where even Conjugacy breaks down. We can use sampling to get a good enough approximation to the posterior even when the measure we want to consider is not sigma-finite. Finally MCMC is also very useful for RL and one of my goals is to use BNP methods to make better RL agents. ## Additional Resources ### Workshps 1. **Tamara Broderick** from *UCSC* has given a number of workshop that are available on her [personal website](https://tamarabroderick.com/tutorials.html) and her [YouTube channel](https://www.youtube.com/@tamarabroderick/videos). Some of these are: - [Gaussian Processes](https://tamarabroderick.com/tutorial_2025_cps-fr.html) - [Dirichlet Process]() - [Variational Bayesian]() I found these as the best starting points - 2024 [slides](https://tamarabroderick.com/tutorial_2024_cps-fr.html) - Tamara Broderick Tutorial on the Dirichlet Process from MLSS 2016 ML the University of Cádiz in Cádiz, Spain [link](https://tamarabroderick.com/tutorial_2016_mlss_cadiz.html) 1. **Abel Rodriguez** from *UCSC* has a [website](https://users.soe.ucsc.edu/~abel/website/Teaching.html) the instructor of the third course on mixture model taught a [short BNP course](https://users.soe.ucsc.edu/~abel/website/Teaching.html) | [L1](https://users.soe.ucsc.edu/~abel/website/Teaching_files/BNP_Brazil_lecture1.pdf), [L2](https://users.soe.ucsc.edu/~abel/website/Teaching_files/BNP_Brazil_lecture2.pdf), [L3](https://users.soe.ucsc.edu/~abel/website/Teaching_files/BNP_Brazil_lecture3.pdf), [L4](https://users.soe.ucsc.edu/~abel/website/Teaching_files/BNP_Brazil_lecture4.pdf), [L5](https://users.soe.ucsc.edu/~abel/website/Teaching_files/BNP_Brazil_lecture5.pdf) 1. **Athanasios Kottas** from * UCSC* [home page](https://users.soe.ucsc.edu/~thanos/) of has made the following notes available on his website: - Tutorial on [Nonparametric Bayesian density regression: modeling methods and applications](https://users.soe.ucsc.edu/~thanos/BNPnet_tutorial_notes.pdf) - Short course on [Applied Bayesian Nonparametric Mixture Modeling](https://users.soe.ucsc.edu/~thanos/NPB_course_notes.pdf) with [references (16 pages)](https://users.soe.ucsc.edu/~thanos/NPB_course_references.pdf) with 52 of them being his own papers. Also I noticed that there are some reference book that teach non-parametric statistics that people used to go to to get started. It might be worthwhile to list these along with what they cover. Also there are a number of young people who have made notes and tutorials available on the web. On the other hand some older people who invented things have also made notes available. 5. **Peter Orbanz** from *University College London* [Tutorials on Bayesian Nonparametrics](https://www.gatsby.ucl.ac.uk/~porbanz/npb-tutorial.html) - He recommends [@kallenberg1997foundations] 6. **Yee Whye Teh** from *University of Oxford* [home page](https://www.stats.ox.ac.uk/~teh/) has made written some papers on some topics I am intersted in - [@TehJor2010a] "Hierarchical {B}ayesian Nonparametric Models with Applications" offers a whirlwind tour of BNP models and their applications. Similar to Tamara Broderick's segment on other models in her tutorial but 48 pages worth. - Chinese Restaurant Franchise - Hidden Markov Models with Infinite State Spaces - Could help RL agents use finite data bound approximation to infinite state spaces. - Hierarchical Pitman-Yor Process - for language modeling I think that this one is great intro for rolling your own HBNP models. With strong intuition for topic modeling. - [@HinOsiTeh2006] "A Fast Learning Algorithm For Deep Belief Networks." because RL agents are overconfident and could use a fast learning algorithm that factors in uncertainty. - [@Teh2006b] "A Hierarchical Bayesian Language Model based on Pitman-Yor Processes." may be a the language model that uses Lewis Signaling solution as its support - [@WooGasArc2011a] "The Sequence Memoizer" ### Courses from GP Summer School This youyube channel has a number of videos from the Gaussian Process Summer School. I haven't watched all of them but they seem to be a great resource for learning about Gaussian Processes and their applications. ::: {#vid-playlist-gp-summer-school .column-margin group="slides" width="53mm"} {{< video https://youtu.be/B4gwsYS6wsY?list=PLZ_xn3EIbxZEoWLlm9y6OizFkontrhA6G >}} 2024 - 8 videos ::: ::: {#vid-playlist-gp-summer-school .column-margin group="slides" width="53mm"} {{< video https://youtu.be/tkDYEAoN5Eo?list=PLZ_xn3EIbxZGcqHGFj-P_SI6OCXy8TfoL >}} 2021 - 9 videos ::: ::: {#vid-playlist-gp-summer-school-2 .column-margin group="slides" width="53mm"} {{< video https://youtu.be/1n9s8l7mXoE?list=PLZ_xn3EIbxZEoWLlm9y6OizFkontrhA6G >}} 2020 - 9 videos ::: ::: {#vid-playlist-gp-summer-school-3 .column-margin group="slides" width="53mm"} {{< video https://youtu.be/Kkg9ipRZ3YM?list=PLZ_xn3EIbxZHoq8A3-2F4_rLyy61vkEpU >}} 2019 - 12 videos ::: ### Software - [Stan](https://mc-stan.org/) is a probabilistic programming language that allows you to specify and fit Bayesian models. It has a large community and a lot of resources available for learning how to use it. - [TensorFlow Probability](https://www.tensorflow.org/probability) is a library for probabilistic modeling and inference in TensorFlow. It allows you to build and train complex probabilistic models using TensorFlow's computational graph. - [BayesFlow](https://bayesflow.org/main/index.html) - [Pyro](https://pyro.ai/) is a probabilistic programming language built on top of PyTorch. It allows you to specify and fit Bayesian models using PyTorch's computational graph. - [PyMC](https://www.pymc.io/) is a Python library for probabilistic programming that allows you to specify and fit Bayesian models using MCMC and variational inference.