I decided to migrate some material that is auxiliary to the course:
- An overview of the course.
- A review of some mathematical and statistical results used in the course.
- A bibliography of books I found useful in the course.
- A Feynman notebook for the course - is now in a separate notebook.
Course Card
- Course: Bayesian Statistics: Time Series
- Offered by: University of California, Santa Cruz
- Instructor: Raquel Prado
- Certificate: Yes
- Level: Graduate
- Commitment: 4 weeks of study, 3-4 hours/week
Overview of the course
- This course seems very similar to classic basic time series course without the Bayesian part. (AR, MA, ARMA, ARIMA, SARIMA, DLM etc.) However besides the AR(p) and NDLM which are generalize AR(p) model,
- We don’t cover other models in any depth and in many places the instructor may refers to them, so I can say that while the course is self contained, it is not going to be sufficient to cover basics of time series analysis and you will need to learn about these elsewhere.
- This is unfortunate as NDLM seem to be a very general framework able to cover a wide range of time series models as special cases. However I cannot say much more as having taken just this course I still have much to learn about these other models. That said there are many good resources on the web on Time series analysis and I will list some of them in the bibliography below.
- Another point that vexed me is that the course makes use of some mathematics that Is not covered in a typical undergraduate mathematics program and the instructor neither explains them which might be because they are out of scope. But she also does not state this up front, nor mentions that these are prerequisites, provides a reference or a handout like the was done in the previous courses in the specialization. And I can say that some of these are also notably absent from the course text book Prado, Ferreira, and West (2023) and the other reference West and Harrison (2013).
- Code for some of the examples like the two case studies is missing.
- The NDLM is closely related to the Kalman filter. Unfortunately this is not discussed in the course. This is unfortunate as giving some examples from this literature might provide very familiar applications of the NDLM and also provide some motivation and context that is lacking.
- The Nile River data set that is used to illustrate a number of code examples is not a good example of a time series for AR(p) or NDLM models as it exhibits long term memory and requires a more sophisticated model like the ARFIMA which is very advanced and far beyond the scope of this this course.
Anyhow I try to mitigate these issues in my notes as best I can but this will not be a substitute for a good time series course and may take a lot of time to complete. Perhaps I will revisit this once I complete the specialization and have a better understanding of the material.
One of the questions I had when I started this course was what is the difference between a Bayesian approach to time series analysis and a classical approach. The following is a summary of what I found:
The Bayesian approach presents primarily in:
- Sections on Bayesian inference where we do inference on the parameters of the models.
- Bayesian prediction unlike an MLE prediction is a distribution of predictions not just a point estimate, and therefore is useful for quantifying uncertainty.
- We also cover some material on model selection - this again is where the Bayesian approach to optimization presents more powerful tools than the classical approach.
- When we want to quantify the uncertainty in our model we have four sources of uncertainty:
- Uncertainty due to using the correct model (structure).
- I consider this is an epistemic uncertainty -
- One could reduce it by collecting more data, then applying the Bayesian model selection to choose the best model.
- Uncertainty due to the estimation of the model parameters. This is an epistemic uncertainty - we can reduce it by collecting more data reducing the plausible intervals for these parameters under the bayesian approach.
- Uncertainty due to random shocks \varepsilon_t. for the period being predicted. This is an aleatory uncertainty.
- Uncertainty in the forecasted values X_{t+h} Items 2-3 can be quantified using a plausible interval in the Bayesian approach and as we predict further into the future the interval will grow.
- Uncertainty due to using the correct model (structure).
- Model selection is a big part of the Bayesian approach. We can use the DIC, WAIC, and LOO to compare models.
The book by Professor Prado is very comprehensive and covers plenty of additional models and references lots of recent research. These including VAR, VARMA models, Kalman filters, SMC/Particle filters, etc. These are useful for the continuous control flavours of RL. But you will need to learn it on your own.
In the capstone project that is the next course in the specialization the teacher adds another layer of sophistication by introducing mixtures of TS models.
However unlike some courses I took we dive deep enough and get sufficient examples to understand how to put all the bits together into more sophisticated time series models.
One issue is that no one has published notes on this course so far - and very few people have completed the course or the specialization compared to the number of people who finished the first course.
I found this course poorly structured and often felt it was wasting my time on trying to sort things out, searching for motivations or and looking material up in the text books or external sources.
To a large extent much of the mathematics we learned in the last three courses isn’t relevant for this course. The AR(1) and AR(p) processes are autoregressive which imposes a specific Algebraic structure that we haven’t seen before. The NDLMs seem to be based on regression but in reality they are extending the AR(p) process while doing so and thus have to make room to incorporate the autoregressive structure of the AR(p) process. A lot of the equations are quite different and we need to use a number of techniques from numerical linear algebra and functional analysis. These are introduced in handouts or just reference by name during the videos. (Toeplitz matrix, Durbin-Levinson recursion, Yule-Walker equations, Wald’s theorem, fourier analysis, complex roots of characteristic polynomials, The generalized Inverses, and its role as a least square estimator and as a Maximum Likelihood estimator etc. Newton Raphson approximation. Random walks, Kalman Filters) In reality many of these were vaguely familiar but not necessarily in R or in Matrix form. I noticed that most of these were also missing even from the course text books and from the wider references recommended by the instructors of the previous courses in the specialization. The instructor did not state that this is a prerequisite nor that it is just a result she is using without explaining it.
I often had to read the books to make sense of the material.
- Prado, Ferreira, and West (2023) is comprehensive but more often than not less than useful - requiring jumping back and forth through the material in a way that only makes sense to someone who know most of it already. It seems like the course either lacking a coherent structure or that this simply is too illusive for a student of this course to figure out. Also it is has lot and lots of material we don’t need. What I need is more like a tutorial and motivation for the main results.
- West and Harrison (2013) I found the other reference more useful but also a big time waster in the sense that it can be very needlessly metaphysical, often feels like self-promotion and goes into endless detail about the more trivial cases. Also the authors frequently send the reader to other reference for details about non DLM models, but in this course we also cover AR(p). Much of the approach of the DLM is based on the super position principle which allows us to mix and match different models into an DLM, but the authors assume that the reader is already familiar with these concepts.
- So covering the material for the course takes much too much time. There are some important results that Prado references that are explained in the final chapter but the notation is quite different and I wasn’t able to fill out the maths she goes over despite being convinced that the two key results about NDLM that she presented are indeed correct.
Only once most of my notes were written could I start to see how this material comes together. However I must have spent 5-10x time more than was required to complete the course and I still need to review it once or twice more.
Mathematical Review
There is a issues with mathematics most of the results and techniques are so rarely useful that students will soon forget most but a few very useful results. Having a good memory is a great asset in mathematics but is rarely enough. I like to review some mathematical results from my undergraduate days every five years or so. This helps me keep many of the results fresh in my mind and also makes reading new mathematics easier. Fundamentals in mathematics can fo a very long way. This is material from topology, determinants and solving linear equations, numerical methods for decomposing matrices, and so on. Definitions of certain groups.
One reason this and other Bayesian courses and books can be challenging and even overwhelming is that they can use lots of mathematics. This can range from high school material like complex numbers and quadratics formulas to intermediate results like finding root of characteristic polynomials, eigenvalues, Topelitz matrices, jordan forms, and advanced topics like the Durbin-Levinson recursion and certain results from fourier analysis and even from functional analysis theory.
Note that I have not even touched on probability and statistics in that list.
Rather than complain I see this as an opportunity to review/learn some mathematics and statistics that can be useful to a data scientist. During my last sting in Data science I often was able to write formulas but more often then not felt that I lacked sufficient mathematical tools to manipulate them to get the kind of results I wanted. Rather then learning lots of mathematics I wanted to find the most practical and useful results for wrangling maths. When I was a physics undergraduate these might be trigonometric identities, completing the square, being familiar with many integrals and Taylor or Maclaurin series approximations and a few useful inequalities occasionally we use l’Hopital’s rule. Familiarity with some ODEs was also greatly beneficial as these come up in many physical models. Later on hermitian and unitary matrices, fourier expansions, spectral theory, and some results from functional analysis were useful.
For statistics we have the variants of the law of large numbers and the central limit theorem, convergence theorems, manipulations of the normal distribution, linear properties of expectation can get you along way. But you have to remember lots of definitions and there are lots of results and theorems that seem to be stepping stones to other results rather than any practical use.
On the other hand conjugacy of certain distributions as demonstrated by Herbert Lee and other instructors in this specialization are often very challenging. Charts of Convergence of distributions to other distributions under certain conditions are neat but. There is Hoeffding’s inequality and the Markov’s inequality which can be useful but like most results in mathematics I never had a where they might be used. Then there are certain results - convergence of Markov chains, doubly stochastic matrices. De Finetti’s theorem in statistics.
I have found that the more I learn the more I can understand and appreciate the material.
- The autoregressive process gives rise to matrices that have diagonal bands and more specifically Toeplitz matrices which can be solved using the Durbin-Levinson recursion mentioned many times in the course.
- Durbin-Levinson recursion - is an advanced topic not covered in Numerical Analysis courses or Algebra courses I took.
- To use it with time series we also need to understand the Yule-Walker equations.
- ar(p) require some linear algebra concepts like eigenvalues and Eigenvectors, and characteristic polynomials over \mathbb{C}
- The infinite order moving average representation for AR(p) requires the Wold decomposition theorem and this is not a result I recall learning in my functional analysis course. We also use some complex numbers and Fourier analysis and spectral density functions.
Summarize some of the extra curricular material I found useful in the course.
Complex Numbers (Review)
When we wish to find the roots of real valued polynomials we will often encounter complex numbers. In this course such polynomials arise naturally in the characteristic polynomials of AR(p) processes.
We will need the polar form of complex numbers to represent some variants of AR(p) process.
The numbers in the Complex field z \in \mathbb{C} numbers are numbers that can be expressed in the form z = a + bi, where a,b\in\mathbb{R} and i is the imaginary unit. The imaginary unit i is defined as the square root of -1. Complex numbers can be added, subtracted, multiplied, and divided just like real numbers.
The complex conjugate of a complex number z = a + bi is denoted by \bar{z} = a - bi. The magnitude of a complex number z = a + bi is denoted by |z| = \sqrt{a^2 + b^2}. This is sometimes called the modulus of the complex number in this course. The argument of a complex number z = a + bi is denoted by \text{arg}(z) = \tan^{-1}(b/a). The polar form of a complex number is given by z = r e^{i \theta}, where r = |z| and \theta = \text{arg}(z).
The polar form of a complex number is given by:
\begin{aligned} z &= \mid z\mid e^{i \theta} \\ &= r (\cos(\theta) + i \sin(\theta)) \end{aligned} \tag{1}
where:
- |z| is the magnitude of the complex number, i.e. the distance from the origin to the point in the complex plane.
- \theta is the angle of the complex number.
I think we will also need the unit roots.
Eigenvalues, Eigenvectors the characteristic polynomials and Unit roots
The Eigenvalues of a matrix are the roots of the characteristic polynomial of the matrix. The characteristic polynomial of a matrix A is defined as:
\begin{aligned} \text{det}(A - \lambda I) = 0 \end{aligned}
where \lambda is the Eigenvalue and I is the identity matrix. The eigenvectors of a matrix are the vectors that satisfy the equation:
\begin{aligned} A v = \lambda v \end{aligned}
where v is the eigenvector and \lambda is the eigenvalue. The eigenvalues and eigenvectors of a matrix are used in many applications in mathematics and physics, including the diagonalization of matrices, the solution of differential equations, and the analysis of dynamical systems.
Unit Roots
A unit root is a root of the characteristic polynomial of an autoregressive model that is equal to 1. The presence of a unit root in an autoregressive model indicates that the model is not stationary. The unit root test is a statistical test that is used to determine whether a time series is stationary or non-stationary. The unit root test is based on the null hypothesis that the time series has a unit root, and the alternative hypothesis that the time series is stationary. The unit root test is used to determine whether a time series is stationary or non-stationary, and is an important tool in time series analysis.
Spectral analysis (1898)
The power spectrum of a signal is the squared absolute value of its Fourier transform. If it is estimated from the discrete Fourier transform it is also called periodogram. Usually estimated using the a fast Fourier transform (FFT) algorithm.
Kalman Filter (1960)
\begin{aligned} x_{t} & = F_{t} x_{t-1} + G_{t} u_{t} + w_{t} && \text{(transition equation)} \\ y_{t} & = H_{t} x_{t} + v_{t} && \text{(observation equation)} \end{aligned} \tag{2}
where:
- x_{t} is the state vector at time t,
- F_{t} is the state transition matrix,
- G_{t} is the control input matrix,
- u_{t} is the control vector,
- w_{t} is the process noise vector,
- y_{t} is the observation vector at time t,
- H_{t} is the observation matrix,
- v_{t} is the observation noise vector.
The Kalman filter is a recursive algorithm that estimates the state of a linear dynamic system from a series of noisy observations. The Kalman filter is based on a linear dynamical system model that is defined by two equations: the state transition equation and the observation equation. The state transition equation describes how the state of the system evolves over time, while the observation equation describes how the observations are generated from the state of the system. The Kalman filter uses these two equations to estimate the state of the system at each time step, based on the observations received up to that time step. This could be implemented in real time in the 1960s and was used in the Apollo missions.
The Extended Kalman Filter (EKF) is an extension of the Kalman filter that can be used to estimate the state of a nonlinear dynamic system. The EKF linearizes the nonlinear system model at each time step and then applies the Kalman filter to the linearized system. The EKF is an approximation to the true nonlinear system, and its accuracy depends on how well the linearized system approximates the true system.
Box Jenkins Method (1970)
A five step process for identifying, selecting and assessing ARMA (and similar) models.
- There are three courses on Stochastic Processes on MIT OCW that I found useful:
- Introduction to Stochastic Processes
- Discrete Stochastic Processes
- has lecture videos and notes
- poisson processes
- Advanced Stochastic Processes
- martingales
- ito calculus