98 Bayesian Inference in the NDLM: Part 1 M3L7

Video 98.1: Sean Law - STUMPY: Modern Time Series Analysis with Matrix Profiles

We will now delve into Bayesian inference for the case where we have an NDLM with both observational variance and system variance known. We will talk about filtering equations, smoothing equations and also forecasting in this setting using Bayesian approach.

Reality check

This lesson is one of several dedicated to Bayesian inference on the NDLM. Professor Prado doesn’t outline this study of apparently eclectic collection of NDLM or explain the NDLM framework upfront. It doesn’t mean there isn’t a method to the madness, however it is up to the student to sort things out.
I can say that based on the three prior courses we would follow a similar methodology:
1. Pick an [inductive bias](https://en.wikipedia.org/wiki/Inductive_bias and use it write down some models,
2. Set a prior
3. Use Bayes + the data to infer the form of the posterior
4. Use an algorithm, possibly a Gibbs sampler to approximate the posterior
5. Recover the parameters by marginalizing the posterior.
6. ~~Validate convergence of the Algorithm.~~
7. Validate via prior predictive and posterior predictive checks
8. Optimize hyperparameters using AIC or BIC
9. Select the best model via DIC.

Working with AR(p) loosely followed this process so I expected NDLM would be the same. In the NDLM framework the algorithm is the Kalman Filter. Kalman Filter had originally a Bayesian formulation. The Kalman filter lets us do filtering and smoothing to infer the system parameters we call \theta_t. The Kalman filter uses a set of recursive equations to update the state and returns these with an uncertainty estimate! The NDLM framework is mostly a scaffolding to handle the requirements of the Kalman filter (~~discretization of continuous time equations~~, assembling the different part into a single system evolution matrix, converting them into the representation a KF needs, and assembling the different sources of uncertainty i.e. v and w into just one of each per overall model, linearity, ~~controllability~~, setting the boundary conditions from the priors

But I did not know all this. Initially I expected that once we choose a (normal) prior \mathcal{N}(M_0,\mathbf{C}_0) we would infer \{F,G,V,W\} from the data. I supposed that no one has been able to write a prior for F and G and even Kalman in his paper (Kalman 1960) does not describe how to work out the linear system G.

While it is never stated explicitly our choice of F and G are very much modeling decisions. F maps the output of \theta\; G and so must be compatible in two sense

F must have the same number of rows as \theta has columns
F must have the same number of non-zero rows as the output y_t

G is the hidden state’s evolution matrix. In the NDLM, we don’t know very much about the data generating process. Unlike the usual setting for Kalman Filter we don’t have access to a dynamical system. All we have is a time series and as modelers we need to construct a surrogate state space model based on very vague prior assumption likely based on some exploration of the time series. (Another problem which could be mitigated via model selection but can also introduce a new issue of too many comparisons if we can choose from too many models)

But this is not the case. In fact we don’t seem to be particularly Bayesian when going about constructing the NDLM model. Similar to constructing a neural network we pick an architecture \sum \{\mathbf{F}_{i,t},\mathbf{G}_{i,t},\cdot,\cdot\} with a hyperparameter like a polynomial degree q, an AR(p) or the number of fourier components r and we give it to the Kalman Filter, which returns an estimate for the \boldsymbol{\theta}_t for us. But that still leaves \nu_t and \omega_t.

Now as far as Kalman Filters go \nu_t and \omega_t are errors inherent in the system that cannot be eliminated. One consequence is that our estimates of theta will never converge to the actual state, though they can reach optimality. So as far as the KF is concerned \nu_t and \omega_t are

(i.e. filtering smoothing, of the parameters \theta_t given the data and forecasting y_t based on that)

I was initially disappointed that we assume that we know everything I want to infer. This is an open question for the proverbial Feynman Notebook. In reality that is just another problem. But that is not the process with time series analysis.

However, at some point I saw a talk about STUMPY which does many cool time series stuff in python. And the speaker Sean Law talks about all the different Data Science & DSP Mojos #magicspells one can use, the nascent issues when comparing lots of times series or even just time series with lots of data. He makes another point that there is #NoFreeLunch - each Mojo comes with with its own assumptions and limitations before making the case for using Matrix Profiles.

Prado lays the ground for working with a time series with around 300 data points - i.e. something we can still plot and view on the screen then inspect it. This might unlock for us, the NDLM framework which while flexible requires us to have a detailed form of the the model and its priors. This is easy enough if we deal with a synthetic data set, less so if we have need to analyze an novel time series for which we lack much intuition. Here we will need todo a preliminary exploratory data analysis before we can step in and construct our NDLM.

I suppose they expect that you will use some other non-bayesian tools to do this as we haven’t really covered this in her course.

Splitting a TS into trend periods and a stationary residual isn’t too tricky in R. Getting the inverse periods is then possible. Doing regression on the residuals is also possible. So if we have done all that we should be able to write down an NDLM based on all that we learned. And this will allow us to do filtering, smoothing and forecasting with error estimates.

p.s. this is a long rant once it’s done I think it can be moved to the course intro as an overview of the NDLM part.

98.1 Filtering 🎥

Figure 98.1: Derivation for the Prior and Forecast at Time t

Figure 98.2: Derivation of the Posterior at Time t

Recall we are working in a Bayesian setting where a NDLM model with a normal prior would like this:

\begin{aligned} y_t &= F_t' \theta_t + \nu_t & \nu_t &\sim \mathcal{N}(0, v_t) & \text{(observation)} \\ \theta_t &= G_t \theta_{t-1} + \omega_t & \omega_t &\sim \mathcal{N}(0, W_t) & \text{(evolution)} \\ & &(\theta_0 \mid \mathcal{D}_0) & \sim \mathcal{N}(m_0, \mathbf{C}_0) & \text{(prior)} \end{aligned} \tag{98.1}

We are assuming:
- \theta_0 \sim \mathcal{N}(m_0, \mathbf{C}_0) variance covariance matrix.
- In the prior \mathcal{D}_0 denotes the information set that we have before collecting any data and

Since we are doing filtering which is a retrospective analysis, of past states we assume that we know \mathbf{F}_t, \mathbf{G}_t, m_0, \mathbf{C}_0, \nu_t, \omega_t \quad \forall t.

However, there is often great interest in looking back in time in order to get a clearer picture of what happened.

We are interested in performing Bayesian inference in this setting and we talked about different kinds of distributions.

One is the filtering distribution that allows us to update the distribution of \theta_t as we receive observations and information over time.
The other one is smoothing equations that allows us to just revisit the past once we have observed a chunk of data.

In a Bayesian setting, you have to set a prior distribution. We will work with the prior distribution that is conjugate.

In this case we have to begin with a distribution at time zero for \theta_0. So before we have seen any data at all, I have this prior distribution.

We also assume a prior distribution of the form:

(\theta_{t} \mid \mathcal{D}_{t-1}) \sim \mathcal{N}(m_{t-1}, C_{t-1}). \tag{98.2}

We assume that this the filtering distribution follows this normal distribution based on

the prior in Equation 98.10 being conjugate of the normal and
the linearity of the model in Equation 98.10.

These result in updates to the model parameters and uncertainty, at each time step, preserving the normal structure from the prior.

Then, we can obtain the following distributions:

Prior at Time t

(\theta_t \mid \mathcal{D}_{t-1}) \sim \mathcal{N}(a_t, R_t) \qquad \text{(prior at time t)} \qquad \tag{98.3}

with

\begin{aligned} a_t \doteq& \mathbb{E}[\theta_t \mid \mathcal{D}_{t-1}] =& G_t \mathbb{E}[G_t \theta_{t-1} \mid \mathcal{D}_{t-1} ] =& G_t m_{t-1} \\ R_t \doteq& \mathbb{V}ar[\theta_t \mid \mathcal{D}_{t-1}] =& G_t \mathbb{V}ar[\theta_t \mid \mathcal{D}_{t-1}] =& G_t C_{t-1} G_t' + W_t. \end{aligned} \tag{98.4}

Where we simply took the first and second moments of the system equation from Equation 98.10 conditioned on our information set \mathcal{D}_{t-1}

One-Step Forecast

(y_t \mid \mathcal{D}_{t-1}) \sim \mathcal{N}(f_t, q_t) \quad \text{(one step forecast fn)} \qquad \tag{98.5}

with:

\begin{aligned} f_t &\doteq \mathbb{E}[ y_t \mid \mathcal{D}_{t-1} ] \\ &= F_t' \; \mathbb{E}[ y_t \mid \mathcal{D}_{t-1} ] \\ &= F_t' \; a_t \\ \end{aligned} \tag{98.6} \begin{aligned} q_t & \doteq \mathbb{V}ar[y_t \mid \mathcal{D}_{t-1}] \\ &= F_t' \; \mathbb{V}ar[y_t \mid \mathcal{D}_{t-1}] \\ &= F_t' R_t F_t + v_t \end{aligned} \tag{98.7}

Where we took the first moments on the observation equation conditioned on the information set \mathcal{D}_t and substituted Equation 98.5

Posterior at Time t

(\theta_t \mid \mathcal{D}_t) \sim \mathcal{N}(m_t, C_t)

with

\begin{aligned} m_t &= a_t + R_t F_t q_t^{-1} (y_t - f_t), \\ C_t &= R_t - R_t F_t q_t^{-1} F_t' R_t. \end{aligned} \tag{98.8}

These can be derived via Normal theory or via the Multivariate Bayes’ theorem. The background for both seems to be provided in (West and Harrison 2013, secs. 17.2.3 p.639)

Now, denoting e_t = (y_t - f_t) and A_t = R_t F_t q_t^{-1}, we can rewrite the equations above as:

It follows that

\begin{pmatrix}Y \\ \theta\end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix}F'a \\ a \end{pmatrix}, \begin{pmatrix} F'RF + V & F'R \\ RF & R \end{pmatrix} \right)

Therefore, identifying Y with X_1 and \theta with X_2 in the partition of X in 17.2.2, we have RF R )]

Therefore, identifying Y with X1 and \theta with X2 in the partition of X in 17.2.2, we have Y \sim \mathcal{N}[F'a, F'RF + V]

(\theta \mid Y) \sim \mathcal{N}[m, C],

where

m = a + RF[F′RF + V]−1[Y − F′a]

and C = R − RF[F′RF + V]−1F′R.

\begin{aligned} \theta \mid \mathcal{D}_t &\sim \mathcal{N}(m_t,C_t)\\ m_t &\doteq a_t + A_t e_t, \\ C_t &\doteq R_t - A_t q_t A_t' \end{aligned} \tag{98.9}

Equation 98.4 , ?eq-derivation-for-f-t-and-q-t and Equation 104.12 are often referred to as the Kalman filtering equations.

Video Transcript

I will now discuss Bayesian inference in the case of the normal dynamic linear model where both the observational variance and the system variance are known. We will talk about filtering equations, smoothing equations and also forecasting in this setting using Bayesian approach.

~~So recall we are working with a model that looks like this: … And then this is my first equation, the observation equation and I have a system equation that looks like this.~~

We are going to assume that V_t and W_t are known for every t. And we also know what the F_t’s and the G_t’s are here. So the response is a uni-dimensional y_t and then I have, say, \theta_t is a vector of a given dimension, depending on the structure of the model.

We are interested in performing Bayesian inference in this setting and we talked about different kinds of distributions

One is the filtering distribution that allows us to update the distribution of \theta_t as we receive observations and information over time.
The other one is smoothing equations that allows us to just revisit the past once we have observed a chunk of data.

~~So I will be talking about those and also smoothing.~~

~~In a Bayesian setting, you have to set a prior distribution.~~ We will work with the prior distribution that is conjugate.

~~In this case I have to begin with distribution at time zero. So before I know, I have seen any data at all,~~ I have this prior distribution. D_0 stands for the information that I have before collecting any data. And we are going to assume, That this \theta_0 follows a normal distribution with m_0 mean and variance covariance matrix C_0. So these are also specified when you’re working with this model.

So we assume that this m_0 and C_0 is known.

Once we have this setting using these equations, we can obtain the filtering equations.

So the first assumption is going to be that we have, a structure.

So for \theta_{t -1} \mid \mathcal{D}_{t-1} is going to have this normal structure which is going to happen basically because we’re using this conjugate prior. And because we have normal structure in the model, is going to lead to the following distribution. So the first one is the prior at time t.

So if I want to think about why my distribution for the t is given the information I have up to t-1, I can look at the equations of the model and use this second equation. And by looking at this equation, if I condition on the information I have up to t-1, I can see that, say, \theta_t is written as a linear function of, \theta_{t -1} and I have the assumption of normality here.

Therefore, say, \theta_t going to follow a normal distribution with some mean and some variance. So now we’re going to compute this mean and this variance using this equation. So if you think about the expected value of \theta_t, given D_{t -1}, that’s just going to be G_t is a constant here. So I have my G_t and then I have expected value of \theta_{t -1} given G_{t -1} plus expect the value of this \omega_t.

But \omega_t is a zero mean, normally distributed quantity, so it’s just going to be zero. Using the assumption that I have this structure, then I have that the \mathbb{E}[\theta_t \mid \mathcal{D}_{t -1}] = G_t \times m_{t-1}. We’re going to call this quantity a_t, so we have here a_t. For the variance covariance matrix, then we just have to compute, do the same type of operation. And again, we can use this equation and see that we obtain this G_t variance of \theta_{t-1} \mid \mathcal{D}_{t -1} G_t'. And then we have now the variance of the omega, the variance of the omega is just W_t. So we have G_t = C_{t -1} G_t' + W_t. So we can call this quantity R_t and just have the form of this prior distribution at time t.

I can now think about another distribution which is the distribution of y_t \mid \mathcal{D}_{t-1}. So this is the so called one-step ahead, Forecast, And in the one-step ahead forecast again is a similar type of structure. So now we’re going to use the first equation rather than the second equation and we see that y_t is written in terms of a linear function of \theta_t. And we have also the Gaussian in assumption here. So again the y_t is going to be normally distributed, And we just have to compute the mean and the variance for this y_t. So using the first equation, we have the expected value of y_t given D_{t -1} is just F_t' \mathbb{E}[\theta_t \mid D_{t -1}]. And we computed this before, so this is, again, the expected value of \theta_t given D_{t -1} is what we computed here. So this is to be F_t' a_t. And we are going to call this little f_t. Then, for the variance, Again, we use this equation, we have this component, so we are going to get F_t' R_t F_t + D_t. And I’m going to call this q_t. So my final distribution, the one-step ahead forecast distribution, tells me that this follows a normal f_t q_t. The next equations we are going to discuss are the equations that tell me about what is the distribution of \theta_t once we incorporate the information provided by y_t. The next distribution is the posterior of \theta_t given D_t. So that’s, \theta_t given D_t. And we can write D_t as whatever information we have at time t- 1. And the new data point with this just y_t. So we just want to update the distribution of \theta_t given that we have received this additional data point at time t. There are two ways of computing this distribution. One uses normal theory, the other one uses Bayes’ theorem. And you obtain that the distribution of \theta_t given D_t is going to be a normal, with mean we call it m_t and variance C_t. We will see how to obtain this distribution or the moments of this distribution using normal theory.

So, again, we can write down, if we think about just combining the vector \theta_t with the observation

Y_t given D_{t -1}, right? We have information about \theta_t \mid t-1. That’s the prior for \theta_{ta,t}, based on the information at t -1. And then we also computed before the one step ahead forecast distribution for y_t| \mathcal{D}_{t -1}. So we know that when we combine these two in a single vector, we’re going to have a multivariate normal distribution and the first component is going to be a_t. The second component is what we have called F_t, so that’s the mean. And then for the covariance matrix. We’re going to have now, what goes here is just the variance of \theta_t given D_{t -1}, which we have called R_t. What goes here is the variance of y_t \mid \mathcal{D}_{t -1} and we have called this q_t. And now we have to compute the covariance between \theta_t and y_t, and that goes here. And the covariance between y_t and \theta_t, which is just the transpose of that, is going to go here. So if I think about computing the covariance of \theta_t and y_t \mid \mathcal{D}_{t -1}, I can write y_t using the first equation here as a function of \theta_t. That’s going to give us, F_t' \theta_t + v_t given D_{t -1}. And in this one we can see that this is going to give us basically the variance of \theta_t given D_{t -1} and then multiplied by F_t' F_t which gives me the F_t. So this is going to be variance of \theta_t given D_{t -1} times F_t. And then there is a term that combines the \theta_t with the noise but they are independent, so the covariance is going to be zero. So this one is simply going to be my R_t F_t, so this goes here, And what goes here is just the covariance of y_t with \theta_t or the transpose of this. So this is going to give me F_t' R_t', but R_t is a covariance matrix, so R_t' = R_t. So now I have my full multivariate distribution and I can use properties of the multivariate distribution to compute the distribution of, \theta_t, given y_t and D_{t -1}. So that’s going to be a conditional distribution, I’m going to condition on the y_t. And when I combine y_t and D_{t -1} that gives me just the information up to time t. So we are interested in just finding, say, \theta_t given y_t and D_{t -1} which is the same as \theta_t given D_t. We partition the normal distribution in this way, so I can just think about this is the first component and then I have these different pieces in my covariance matrix. And we know from normal theory that if we have a distribution, if we have a vector that is partitioned into vectors here where they are normally distributed. And I have my mean partition here and let’s say I have one component here, Then we know that if I wanted to compute the distribution of X_1 conditional on X_2, that’s going to give me normal, let’s say \alpha^*. And let’s call this one the \sigma^*, where \alpha^* is going to be my \alpha_1 + \sigma_{12}^{-1}. And then I have _1 - \alpha_2 and then I have my \sigma^*. And this one gives me my \sigma_{11} - \sigma_{21}. So this is a result from normal theory. So if I want my conditional distribution of X_1 given X_2 I can apply these equations. So we notice we have the same type of structure here. If I partition my vector and in \theta_t and y_t. And now I condition on, I take the distribution of \theta_t conditioning on y_t. I’m going to have that same structure where this is normal, m_t C_t. And my m_t using normal theory, again, is going to be a_t + \sigma_{22}^{-1}. And then I have y_t - f_t. So that’s my mean and my covariance matrix. It’s going to be R_t - q_t^{-1} and then I have this transpose again. So if we simplify things a bit here and we call e_t, it’s just the error that we make when we compare y_t, which is the observation with the prediction, right? And then I also use the notation I call a_t, let’s call here A_t R_t F_t q_t^{-1}. Then we can write this down, to mean, we can write as a_t + A_t. And the covariance matrix. We can write it as R_t, A_t q_t A_t'. So this gives me the posterior mean after receiving this y_t observation. And you can see that you can write down the posterior mean, has this usual form of the prior plus something that relates to the error that I make with the prediction.

So the y_t appears there and then is weighted by this quantity that we just call a_t.

And for the covariance structure, we are also incorporating information about the prior and what the y_t observation provides. So this gives us our filtering equation for \theta_t given D_t. And now we can apply all these equations as we receive observations from t = 1 all the way to T. If we happen to have T observations in the time series, we can do this filtering process and obtain these distributions as we receive information.

98.2 Summary of filtering distributions 🗒️

98.2.1 Bayesian Inference in NDLM: Known Variances

Consider an NDLM given by:

\begin{aligned} y_t &= F_t' \theta_t + \nu_t, \quad \nu_t \sim \mathcal{N}(0, v_t), \\ \theta_t &= G_t \theta_{t-1} + \omega_t, \quad \omega_t \sim \mathcal{N}(0, W_t), \end{aligned} \tag{98.10}

with F_t, G_t, v_t, and W_t known. We also assume a prior distribution of the form (\theta_0 \mid \mathcal{D}_0) \sim \mathcal{N}(m_0, C_0), with m_0, C_0 known.

98.2.1.1 Filtering

We are interested in finding \mathbb{P}r(\theta_t \mid \mathcal{D}_t) for all t. Assume that the posterior at t-1 is such that:

(\theta_{t-1} \mid \mathcal{D}_{t-1}) \sim \mathcal{N}(m_{t-1}, C_{t-1}). \tag{98.11}

Then, we can obtain the following:

Prior at Time t

(\theta_t \mid \mathcal{D}_{t-1}) \sim \mathcal{N}(a_t, R_t),

with

a_t = G_t m_{t-1} \qquad R_t = G_t C_{t-1} G_t' + W_t.

One-Step Forecast

(y_t \mid D_{t-1}) \sim \mathcal{N}(f_t, q_t),

with

f_t = F_t' a_t, \quad q_t = F_t' R_t F_t + v_t.

Posterior at Time t: (\theta_t \mid \mathcal{D}_t) \sim \mathcal{N}(m_t, C_t) with

\begin{aligned} m_t &= a_t + R_t F_t q_t^{-1} (y_t - f_t), \\ C_t &= R_t - R_t F_t q_t^{-1} F_t' R_t. \end{aligned}

Now, denoting e_t = (y_t - f_t) and A_t = R_t F_t q_t^{-1}, we can rewrite the equations above as:

\begin{aligned} m_t &= a_t + A_t e_t, \\ C_t &= R_t - A_t q_t A_t' \end{aligned}

98.3 code Filtering in the NDLM: Example 🗒️ \mathcal{R}

#################################################
##### Univariate DLM: Known, constant variances
#################################################
set_up_dlm_matrices <- function(FF, GG, VV, WW){
  return(list(FF=FF, GG=GG, VV=VV, WW=WW))
}

set_up_initial_states <- function(m0, C0){
  return(list(m0=m0, C0=C0))
}

### forward update equations ###
forward_filter <- function(data, matrices, initial_states){
  ## retrieve dataset
  y_t <- data$y_t
  T <- length(y_t)
  
  ## retrieve a set of quadruples 
  # FF, GG, VV, WW are scalar
  FF <- matrices$FF  
  GG <- matrices$GG
  VV <- matrices$VV
  WW <- matrices$WW
  
  ## retrieve initial states
  m0 <- initial_states$m0
  C0 <- initial_states$C0
  
  ## create placeholder for results
  d <- dim(GG)[1]
  at <- matrix(NA, nrow=T, ncol=d)
  Rt <- array(NA, dim=c(d, d, T))
  ft <- numeric(T)
  Qt <- numeric(T)
  mt <- matrix(NA, nrow=T, ncol=d)
  Ct <- array(NA, dim=c(d, d, T))
  et <- numeric(T)
  
  for(i in 1:T){
    # moments of priors at t
    if(i == 1){
      at[i, ] <- GG %*% t(m0)
      Rt[, , i] <- GG %*% C0 %*% t(GG) + WW
      Rt[, , i] <- 0.5*Rt[, , i]+0.5*t(Rt[, , i])
    }else{
      at[i, ] <- GG %*% t(mt[i-1, , drop=FALSE])
      Rt[, , i] <- GG %*% Ct[, , i-1] %*% t(GG) + WW
      Rt[, , i] <- 0.5*Rt[, , i]+0.5*t(Rt[, , i])
    }
    
    # moments of one-step forecast:
    ft[i] <- t(FF) %*% (at[i, ]) 
    Qt[i] <- t(FF) %*% Rt[, , i] %*% FF + VV
    
    # moments of posterior at t:
    At <- Rt[, , i] %*% FF / Qt[i]
    et[i] <- y_t[i] - ft[i]
    mt[i, ] <- at[i, ] + t(At) * et[i]
    Ct[, , i] <- Rt[, , i] - Qt[i] * At %*% t(At)
    Ct[, , i] <- 0.5*Ct[, , i]+ 0.5*t(Ct[, , i])
  }
  cat("Forward filtering is completed!") # indicator of completion
  return(list(mt = mt, Ct = Ct, at = at, Rt = 
                Rt, ft = ft, Qt = Qt))
}

forecast_function <- function(posterior_states, k, matrices){
  
  ## retrieve matrices
  FF <- matrices$FF
  GG <- matrices$GG
  WW <- matrices$WW
  VV <- matrices$VV
  mt <- posterior_states$mt
  Ct <- posterior_states$Ct
  
  ## set up matrices
  T <- dim(mt)[1] # time points
  d <- dim(mt)[2] # dimension of state-space parameter vector
  
  ## placeholder for results
  at <- matrix(NA, nrow = k, ncol = d)
  Rt <- array(NA, dim=c(d, d, k))
  ft <- numeric(k)
  Qt <- numeric(k)
  
  
  for(i in 1:k){
    ## moments of state distribution
    if(i == 1){
      at[i, ] <- GG %*% t(mt[T, , drop=FALSE])
      Rt[, , i] <- GG %*% Ct[, , T] %*% t(GG) + WW
      Rt[, , i] <- 0.5*Rt[, , i]+0.5*t(Rt[, , i])
    }else{
      at[i, ] <- GG %*% t(at[i-1, , drop=FALSE])
      Rt[, , i] <- GG %*% Rt[, , i-1] %*% t(GG) + WW
      Rt[, , i] <- 0.5*Rt[, , i]+0.5*t(Rt[, , i])
    }
    
    ## moments of forecast distribution
    ft[i] <- t(FF) %*% t(at[i, , drop=FALSE])
    Qt[i] <- t(FF) %*% Rt[, , i] %*% FF + VV
  }
  cat("Forecasting is completed!") # indicator of completion
  return(list(at=at, Rt=Rt, ft=ft, Qt=Qt))
}

## obtain 95% credible interval
get_credible_interval <- function(mu, sigma2, 
                          quantile = c(0.025, 0.975)){
  z_quantile <- qnorm(quantile)
  bound <- matrix(0, nrow=length(mu), ncol=2)
  bound[, 1] <- mu + 
    z_quantile[1]*sqrt(as.numeric(sigma2)) # lower bound
  bound[, 2] <- mu + 
    z_quantile[2]*sqrt(as.numeric(sigma2)) # upper bound
  return(bound)
}

####################### Example: Lake Huron Data ######################
plot(LakeHuron,main="Lake Huron Data",
     ylab="level in feet") # Total of 98 observations

k=4
T=length(LakeHuron)-k # We take the first 94 observations 
                      # only as our data
ts_data=LakeHuron[1:T]
ts_validation_data <- LakeHuron[(T+1):98]

data <- list(y_t = ts_data)

# First order polynomial model 

## set up the DLM matrices 
FF <- as.matrix(1)
GG <- as.matrix(1)
VV <- as.matrix(1)
WW <- as.matrix(1)
m0 <- as.matrix(570)
C0 <- as.matrix(1e4)

## wrap up all matrices and initial values
matrices <- set_up_dlm_matrices(FF, GG, VV, WW)
initial_states <- set_up_initial_states(m0, C0)

## filtering
results_filtered <- forward_filter(data, matrices, 
                                   initial_states)

Forward filtering is completed!

names(results_filtered)

[1] "mt" "Ct" "at" "Rt" "ft" "Qt"

ci_filtered <- get_credible_interval(results_filtered$mt, 
                                     results_filtered$Ct)

## forecasting 
results_forecast <- forecast_function(results_filtered,k, 
                                      matrices)

Forecasting is completed!

ci_forecast <- get_credible_interval(results_forecast$ft, 
                                     results_forecast$Qt)

index=seq(1875, 1972, length.out = length(LakeHuron))
index_filt=index[1:T]
index_forecast=index[(T+1):98]

plot(index, LakeHuron, ylab = "level", 
     main = "Lake Huron Level",type='l',
     xlab="time",lty=3,ylim=c(574,584))
points(index,LakeHuron,pch=20)

lines(index_filt, results_filtered$mt, type='l',
      col='red',lwd=2)
lines(index_filt, ci_filtered[, 1], type='l', 
      col='red', lty=2)
lines(index_filt, ci_filtered[, 2], type='l', col='red', lty=2)


lines(index_forecast, results_forecast$ft, type='l',
      col='green',lwd=2)
lines(index_forecast, ci_forecast[, 1], type='l',
      col='green', lty=2)
lines(index_forecast, ci_forecast[, 2], type='l',
      col='green', lty=2)

legend('bottomleft', legend=c("filtered","forecast"),
       col = c("red", "green"), lty=c(1, 1))

#Now consider a 100 times smaller signal to noise ratio 
VV <- as.matrix(1)
WW <- as.matrix(0.01)
matrices_2 <- set_up_dlm_matrices(FF,GG, VV, WW)

## filtering
results_filtered_2 <- forward_filter(data, matrices_2, 
                                     initial_states)

Forward filtering is completed!

ci_filtered_2 <- get_credible_interval(results_filtered_2$mt, 
                                       results_filtered_2$Ct)

results_forecast_2 <- forecast_function(results_filtered_2, 
                             length(ts_validation_data), 
                             matrices_2)

Forecasting is completed!

ci_forecast_2 <- get_credible_interval(results_forecast_2$ft, 
                                       results_forecast_2$Qt)


plot(index, LakeHuron, ylab = "level", 
     main = "Lake Huron Level",type='l',
     xlab="time",lty=3,ylim=c(574,584))
points(index,LakeHuron,pch=20)

lines(index_filt, results_filtered_2$mt, type='l', 
      col='magenta',lwd=2)
lines(index_filt, ci_filtered_2[, 1], type='l', 
      col='magenta', lty=2)
lines(index_filt, ci_filtered_2[, 2], type='l', 
      col='magenta', lty=2)

lines(index_forecast, results_forecast_2$ft, type='l', 
      col='green',lwd=2)
lines(index_forecast, ci_forecast_2[, 1], type='l', 
      col='green', lty=2)
lines(index_forecast, ci_forecast_2[, 2], type='l', 
      col='green', lty=2)

legend('bottomleft', legend=c("filtered","forecast"),
       col = c("magenta", "green"), lty=c(1, 1))

plot(index_filt,results_filtered$mt,type='l',col='red',lwd=2,
     ylim=c(574,584),ylab="level")
lines(index_filt,results_filtered_2$mt,col='magenta',lwd=2)
points(index,LakeHuron,pch=20)
lines(index,LakeHuron,lty=2)

98.4 Smoothing and forecasting 🎥

We now discuss the smoothing equations for the case of the NDLM, where we are assuming that the variance at the observation level \nu_t and the covariance matrix at the system level \mathbf{W}_t are both known.

\begin{aligned} y_t &= \mathbf{F}_t' \mathbf{\theta}_t + \nu_t, &\nu_t &\sim \mathcal{N} (0, v_t), & \text{(observation)} \\ \mathbf{\theta}_t & = \mathbf{G}_t \mathbf{\theta}_{t-1} + \mathbf{\omega}_t, &\mathbf{\omega}_t & \sim \mathcal{N} (0, \mathbf{W}_t), & \text{(evolution)} \\ &\{ \mathbf{F}_t, \mathbf{G}_t, v_t, \mathbf{W}_t \} &(\mathbf{\omega}_0 \mid \mathcal{D}_0) & \sim \mathcal{N}(\mathbf{m}_0, \mathbf{C}_0) & \text{(prior)} \end{aligned} \tag{98.12}

with F_t, G_t, v_t, W_t, m_0 and C_0 known.

We have discussed the filtering equations, i.e. the process for obtaining the distributions of \theta_t \mid \mathcal{D}_t, as we collect observations over time, called filtering.

We do this by updating the distribution of \theta_t given the data we have collected step by step, as we move forward in time - updating the from the prior distribution.

Now we will discuss what happens when we do smoothing, meaning when we revisit the distributions of \theta_t, given now that we have received a set of observations.

98.4.0.1 Smoothing

For t < T, we have that:

(\theta_t \mid D_T) \sim \mathcal{N}(a_T(t - T), R_T(t - T)),

where

a_T(t - T) = m_t - B_t [a_{t+1} - a_T(t - T + 1)],

R_T(t - T) = C_t - B_t [R_{t+1} - R_T(t - T + 1)] B_t',

for t = (T - 1), (T - 2), \dots, 0, with B_t = C_t G_t' R_{t+1}^{-1}, and a_T(0) = m_T, R_T(0) = C_T. Here a_t, m_t, R_t, and C_t are obtained using the filtering equations as explained before.

98.4.0.2 Forecasting

For h \geq 0, it is possible to show that:

(\theta_{t+h} \mid D_t) \sim \mathcal{N}(a_t(h), R_t(h)),

(y_{t+h} \mid D_t) \sim \mathcal{N}(f_t(h), q_t(h)),

with

a_t(h) = G_{t+h} a_t(h - 1),

R_t(h) = G_{t+h} R_t(h - 1) G_{t+h}' + W_{t+h},

f_t(h) = F_{t+h}' a_t(h),

q_t(h) = F_{t+h}' R_t(h) F_{t+h} + v_{t+h},

and

a_t(0) = m_t, \quad R_t(0) = C_t.

Video Transcript

Smoothing

We know, we now discuss the smoothing equations for the case of the normal dynamic linear model. When we are assuming that both the variance at the observation level is known and the covariance matrix at the system level is also known.

the NDLM we will be inferring

Recall we have two equations here, we have the observation equation, where y_t is modeled as F_t'\theta_t + \text{noise} the noise is \mathcal{N}(0,\nu_t). And we’re assuming that the vt is given. We are also assuming that we know Ft for all t. And then in the evolution equation we have \theta_t= G_t \theta(t-1) + noise. And then again, the assumption for the w_t is here is that they are normally distributed with mean zero, and these variants co variance matrix, capital W_t. So we can summarize the model in terms of F_t, G_t, Vt and W_t, that are given for all t. We have discussed the filtering equations.

Recall what is Filtering?

So the process for obtaining the distributions of \theta_t \mid \mathcal{D}_t, as we collect observations over time is called filtering.

Recall what is Smoothing?

Now we will discuss what happens when we do smoothing, meaning when we revisit the distributions of \theta_t, given now that we have received a set of observations.

Filtering illustrated

So Just to illustrate the process, we have here, \theta_0,\theta_1 all way up to \theta_4. And we can assume just for the sake of the example, that we are going to receive three observations. So we are going to proceed with the filtering, and then once we receive the last observation at time three, we’re going to go backwards and we’re going to revisit the distributions for the state parameters.

So just to remind you how the filtering works, we move forward, before we receive any observations. In the NDLM, when we have all the variances known. The conjugate prior distribution is a \mathcal{N}(m_0,C_0), and this is specified by the user, before collecting any observations.

We can then use the structure of the model, meaning the system equation and the observation equation to obtain the distribution of \theta_t \mid \mathcal{D}_0. Before observing the first y. This gives us first the distribution of \theta_t, \theta_1 \mid \mathcal{D}_0, which is \mathcal{N}(a_1, R_1). And then we can also get the one step ahead forecast distribution for y_1 \mid \mathcal{D}_0, which is a \mathcal{N}(f_1, q_1). And we have discussed how to obtain these moments using the filtering equations.

Then we received the first observation, and the first observation can allows us to update the distribution of . So we obtain now the distribution of \theta1 \mid y_1, and whatever information we have at \mathcal{D}_0. So this gives us \mathcal{N}(m_1, C_1). And using again the structure of the model, we can get the prior distribution for \theta_2 given the one and that’s a \mathcal{N}(a_2, R_2). And then the one step ahead forecast distribution now for y_2 \mid \mathcal{D}_1 and that’s a \mathcal{N}(f_2, q_2). So we can receive y_2 update the distribution of and we can continue this process, now get the priors at T=3. And then once we get the observation at T=3, we update the distribution. And we can continue like this with the prior for \theta_4 and so on. Let’s say that we stop here, at T=3.

And now we are interested in answering the question. Well, what is the distribution for example of \theta_2 given that, now, I obtain not only y_1 and y_2, but also y_3. I want to revisit that distribution using all that information. Same thing for say, the distribution of \theta_0 \mid D_0, y_1, y_2, y_3. So that’s what it’s called smoothing.

So the smoothing equations, allow us to obtain those distributions. So just to talk a little bit about the notation again, in the normal dynamic linear model where v_t and w_t are known for all t’s. We have that this is a normal, so the notation here, the T >t, here. So we’re looking at the distribution of \theta_t, now in the past and that one follows a normal distribution with mean aT(t-T). So the notation here for the subscript T means that I’m conditioning on all the information I have to T. And then the variance covariance matrix is given by this, RT(t-T). So this is just going to indicate how many steps I’m going to go backwards as you will see in the example.

So we have some recursions in the same way that we have the filtering equations. Now we have the smoothing equations. And for these smoothing equations we have that the mean. You can see here, that whenever you’re computing a particular step t- T, you’re going to need a quantity that you computed in the previous step, t-T+1. So you’re going to need that, is a recursion, but you’re also going to need mt and and at+1. So those are quantities that you computed using the filtering equations. So in order to get the smoothing equations, you first have to proceed with the filtering. Similarly for RT(t-T), you have also that depends on something you previously obtained. And then you also have the Ct, the Rt+1 and so on. So those quantities you computed when you were updating the filtering equations. The recursion begins with aT(0) meaning that you are not going to go backwards any points in time. So that is precisely the mean is going to be whatever you computed with the filtering equations of up to T, that’s mT. And then RT(0) is going to be CT. So just to again illustrate how this would work in the example, if we start here right? If we condition, so the first step would be to compute again to initialize using the distribution of given D3. And that is a normal with mean a3(0) and variance covariance matrix R3(0), But those are precisely m3 and C3 respectively. Then we go backwards one step. And if we want to look at what is the distribution of \theta^2, now conditional on D3. That’s a normal with mean a3(-1) and variance covariance matrix R3(-1). So if you look at the equations down here, you will see that, in order to compute a3 (-1), and R3(-1). You’re going to need m2,C2, a3,R3 and then what you computed here these moments in the previous step, a3(0) and R3(0). Then you obtain that distribution and you can now look at the distribution of given D3, that’s the normal a3(-2), R3(-2). And once again, to compute these moments, you’re going to need m1,C1,a2,R2 and then you’re going to need a3(-1),R3(-1). And you can continue all the way down to given D3 using these recursions. So the smoothing equations allow us to, just compute all these distributions. And the important equations work basically because of the linear and Gaussian structure in the normal dynamic linear model.

98.4.1 Forecasting

In a similar way, we can compute the forecasting distributions. Now we are going to be looking forward, and in the case of forecasting, we are interested in the distribution of \theta(t+h) given D_t. And now h is a positive lag. So here we assume that is h≥0. So we are going to have the recursion is a N(a_t(h), R_t(h)). The mean is a_t(h) and we are going to use the structure of the model to obtain these recursions, again. So here we are using the system equation, and the moment at(h) depends on what you computed at a_t(h-1) the previous lag, times G_{t+h}. And then, would you initialize the recursion with a_t(0)=m_t.

Similarly, for the covariance matrix h steps ahead, you’re going to have a recursion that depends on Rt(h-1). And then you’re going to need to input also G_{t+h} and W_{t+h}. To initialize, the recursion with Rt(0)= Ct. So you can see that in order to compute these moments, you’re going to need mt and Ct to start with. And then you’re also going to have to input all the G’s and the W’s for the number of steps ahead that you require.

Similarly, you can compute the distribution, the h steps ahead distribution of y_t+h given Dt. And that one also follows a normal, with mean f_t(h), q_t(h). And now we also have a recursion here, ft(h) depends on at(h) and as we said, a_t(h) depends on a_t(h-1) and so on. And q_t(h) is just given by these equations. So once again, you have to have access to F_{t+h} for all the h, a number of steps ahead that you are trying to compute this distribution. And then you also have to provide the observational variance for every h value. So that you get v_{t+h}. So this is specified in the modeling framework as well. If you want proceed with the forecasting distributions.

98.5 Summary of the smoothing and forecasting distributions 🗒️

98.5.1 Bayesian Inference in NDLM: Known Variances

Consider the NDLM given by:

\begin{aligned} y_t &= \mathbf{F}_t' \mathbf{\theta}_t + \nu_t, &\nu_t &\sim \mathcal{N} (0, v_t), \\ \mathbf{\theta}_t &= \mathbf{G}_t \mathbf{\theta}_{t-1} + \mathbf{\omega}_t, &\mathbf{\omega}_t &\sim \mathcal{N} (0, \mathbf{W}_t), \\ &\{\textcolor{green}{\mathbf{F}_t, \mathbf{G}_t, v_t, \mathbf{W}_t} \} &(\mathbf{\omega}_0 \mid \mathcal{D}_0) &\sim \mathcal{N}(\textcolor{green}{\mathbf{m}_0, \mathbf{C}_0}) \end{aligned} \tag{98.13}

with F_t, G_t, v_t, and W_t known.

We also assume a prior distribution of the form (\theta_0 \mid D_0) \sim \mathcal{N}(m_0, C_0), with m_0 and C_0 known.

Let’s reflect for a moment on the assumption about known quantities (in green): F_t, G_t, are modeling choices as are the parameters of the prior m_0, and C_0.

98.5.1.1 Smoothing

For t < T, we have that:

(\theta_t \mid D_T) \sim \mathcal{N}(a_T(t - T), R_T(t - T)),

where

a_T(t - T) = m_t - B_t [a_{t+1} - a_T(t - T + 1)],

R_T(t - T) = C_t - B_t [R_{t+1} - R_T(t - T + 1)] B_t',

for t = (T - 1), (T - 2), \dots, 0, with B_t = C_t G_t' R_{t+1}^{-1}, and a_T(0) = m_T, R_T(0) = C_T. Here a_t, m_t, R_t, and C_t are obtained using the filtering equations as explained before.

98.5.1.2 Forecasting

For h \geq 0, it is possible to show that:

(\theta_{t+h} \mid D_t) \sim \mathcal{N}(a_t(h), R_t(h)),

(y_{t+h} \mid D_t) \sim \mathcal{N}(f_t(h), q_t(h)),

with

a_t(h) = G_{t+h} a_t(h - 1),

R_t(h) = G_{t+h} R_t(h - 1) G_{t+h}' + W_{t+h},

f_t(h) = F_{t+h}' a_t(h),

q_t(h) = F_{t+h}' R_t(h) F_{t+h} + v_{t+h},

and

a_t(0) = m_t, \quad R_t(0) = C_t.

98.6 Smoothing in the NDLM, Example 🎥

This segment covers the code in the following section

98.7 Code: Smoothing in the NDLM, Example 🗒️ \mathcal{R}

#################################################
##### Univariate DLM: Known, constant variances
#################################################
set_up_dlm_matrices <- function(FF, GG, VV, WW){
  return(list(FF=FF, GG=GG, VV=VV, WW=WW))
}

set_up_initial_states <- function(m0, C0){
  return(list(m0=m0, C0=C0))
}

### forward update equations ###
forward_filter <- function(data, matrices, initial_states){
  ## retrieve dataset
  y_t <- data$y_t
  T <- length(y_t)
  
  ## retrieve a set of quadruples 
  # FF, GG, VV, WW are scalar
  FF <- matrices$FF  
  GG <- matrices$GG
  VV <- matrices$VV
  WW <- matrices$WW
  
  ## retrieve initial states
  m0 <- initial_states$m0
  C0 <- initial_states$C0
  
  ## create placeholder for results
  d <- dim(GG)[1]
  at <- matrix(NA, nrow=T, ncol=d)
  Rt <- array(NA, dim=c(d, d, T))
  ft <- numeric(T)
  Qt <- numeric(T)
  mt <- matrix(NA, nrow=T, ncol=d)
  Ct <- array(NA, dim=c(d, d, T))
  et <- numeric(T)
  
  
  for(i in 1:T){
    # moments of priors at t
    if(i == 1){
      at[i, ] <- GG %*% t(m0)
      Rt[, , i] <- GG %*% C0 %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }else{
      at[i, ] <- GG %*% t(mt[i-1, , drop=FALSE])
      Rt[, , i] <- GG %*% Ct[, , i-1] %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }
    
    # moments of one-step forecast:
    ft[i] <- t(FF) %*% (at[i, ]) 
    Qt[i] <- t(FF) %*% Rt[, , i] %*% FF + VV
    
    # moments of posterior at t:
    At <- Rt[, , i] %*% FF / Qt[i]
    et[i] <- y_t[i] - ft[i]
    mt[i, ] <- at[i, ] + t(At) * et[i]
    Ct[, , i] <- Rt[, , i] - Qt[i] * At %*% t(At)
    Ct[,,i] <- 0.5*Ct[,,i] + 0.5*t(Ct[,,i]) 
  }
  cat("Forward filtering is completed!") # indicator of completion
  return(list(mt = mt, Ct = Ct, at = at, Rt = Rt, 
              ft = ft, Qt = Qt))
}


forecast_function <- function(posterior_states, k, matrices){
  
  ## retrieve matrices
  FF <- matrices$FF
  GG <- matrices$GG
  WW <- matrices$WW
  VV <- matrices$VV
  mt <- posterior_states$mt
  Ct <- posterior_states$Ct
  
  ## set up matrices
  T <- dim(mt)[1] # time points
  d <- dim(mt)[2] # dimension of state parameter vector
  
  ## placeholder for results
  at <- matrix(NA, nrow = k, ncol = d)
  Rt <- array(NA, dim=c(d, d, k))
  ft <- numeric(k)
  Qt <- numeric(k)
  
  
  for(i in 1:k){
    ## moments of state distribution
    if(i == 1){
      at[i, ] <- GG %*% t(mt[T, , drop=FALSE])
      Rt[, , i] <- GG %*% Ct[, , T] %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }else{
      at[i, ] <- GG %*% t(at[i-1, , drop=FALSE])
      Rt[, , i] <- GG %*% Rt[, , i-1] %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }
    
    ## moments of forecast distribution
    ft[i] <- t(FF) %*% t(at[i, , drop=FALSE])
    Qt[i] <- t(FF) %*% Rt[, , i] %*% FF + VV
  }
  cat("Forecasting is completed!") # indicator of completion
  return(list(at=at, Rt=Rt, ft=ft, Qt=Qt))
}

## obtain 95% credible interval
get_credible_interval <- function(mu, sigma2, 
                          quantile = c(0.025, 0.975)){
  z_quantile <- qnorm(quantile)
  bound <- matrix(0, nrow=length(mu), ncol=2)
  bound[, 1] <- mu + z_quantile[1]*sqrt(as.numeric(sigma2)) # lower bound
  bound[, 2] <- mu + z_quantile[2]*sqrt(as.numeric(sigma2)) # upper bound
  return(bound)
}

### smoothing equations ###
backward_smoothing <- function(data, matrices, 
                               posterior_states){
  ## retrieve data 
  y_t <- data$y_t
  T <- length(y_t) 
  
  ## retrieve matrices
  FF <- matrices$FF
  GG <- matrices$GG
  
  ## retrieve matrices
  mt <- posterior_states$mt
  Ct <- posterior_states$Ct
  at <- posterior_states$at
  Rt <- posterior_states$Rt
  
  ## create placeholder for posterior moments 
  mnt <- matrix(NA, nrow = dim(mt)[1], ncol = dim(mt)[2])
  Cnt <- array(NA, dim = dim(Ct))
  fnt <- numeric(T)
  Qnt <- numeric(T)
  for(i in T:1){
    # moments for the distributions of the state vector given D_T
    if(i == T){
      mnt[i, ] <- mt[i, ]
      Cnt[, , i] <- Ct[, , i]
      Cnt[, , i] <- 0.5*Cnt[, , i] + 0.5*t(Cnt[, , i]) 
    }else{
      inv_Rtp1<-solve(Rt[,,i+1])
      Bt <- Ct[, , i] %*% t(GG) %*% inv_Rtp1
      mnt[i, ] <- mt[i, ] + Bt %*% (mnt[i+1, ] - at[i+1, ])
      Cnt[, , i] <- Ct[, , i] + Bt %*% (Cnt[, , i + 1] - Rt[, , i+1]) %*% t(Bt)
      Cnt[,,i] <- 0.5*Cnt[,,i] + 0.5*t(Cnt[,,i]) 
    }
    # moments for the smoothed distribution of the mean response of the series
    fnt[i] <- t(FF) %*% t(mnt[i, , drop=FALSE])
    Qnt[i] <- t(FF) %*% t(Cnt[, , i]) %*% FF
  }
  cat("Backward smoothing is completed!")
  return(list(mnt = mnt, Cnt = Cnt, fnt=fnt, Qnt=Qnt))
}
####################### Example: Lake Huron Data ######################
plot(LakeHuron,main="Lake Huron Data",ylab="level in feet")

# 98 observations total 
k=4
T=length(LakeHuron)-k # We take the first 94 observations 
                     #  as our data
ts_data=LakeHuron[1:T]
ts_validation_data <- LakeHuron[(T+1):98]

data <- list(y_t = ts_data)

## set up matrices
FF <- as.matrix(1)
GG <- as.matrix(1)
VV <- as.matrix(1)
WW <- as.matrix(1)
m0 <- as.matrix(570)
C0 <- as.matrix(1e4)

## wrap up all matrices and initial values
matrices <- set_up_dlm_matrices(FF,GG,VV,WW)
initial_states <- set_up_initial_states(m0, C0)

## filtering
results_filtered <- forward_filter(data, matrices, 
                                   initial_states)

Forward filtering is completed!

ci_filtered<-get_credible_interval(results_filtered$mt,
                                   results_filtered$Ct)
## smoothing
results_smoothed <- backward_smoothing(data, matrices, 
                                       results_filtered)

Backward smoothing is completed!

ci_smoothed <- get_credible_interval(results_smoothed$mnt, 
                                     results_smoothed$Cnt)


index=seq(1875, 1972, length.out = length(LakeHuron))
index_filt=index[1:T]

plot(index, LakeHuron, main = "Lake Huron Level ",type='l',
     xlab="time",ylab="level in feet",lty=3,ylim=c(575,583))
points(index,LakeHuron,pch=20)

lines(index_filt, results_filtered$mt, type='l', 
      col='red',lwd=2)
lines(index_filt, ci_filtered[,1], type='l', col='red',lty=2)
lines(index_filt, ci_filtered[,2], type='l', col='red',lty=2)

lines(index_filt, results_smoothed$mnt, type='l', 
      col='blue',lwd=2)
lines(index_filt, ci_smoothed[,1], type='l', col='blue',lty=2)
lines(index_filt, ci_smoothed[,2], type='l', col='blue',lty=2)

legend('bottomleft', legend=c("filtered","smoothed"),
       col = c("red", "blue"), lty=c(1, 1))

98.8 Second order polynomial: Filtering and smoothing example 🎥

In this video walk through the code provided in the section below the comment

Video Transcript

We now consider another example where instead of fitting a first order polynomial we’re fitting a second order polynomial DLM. So I just want to show you how to set up the structure of the model in a case in which you have a state parameter vector. That is of dimension larger than one in this particular case we have a bivariate state parameter vector. So once again we are going to source this file that has all the DLM functions for the case in which the F, G, V and W are known. So we’re just assuming that this is the case and then we’re assuming that F, G, V and W are constant over time in these examples. So we just I’m going to use a new data set which is also data set available in R this data set corresponds to the atmospheric CO2 concentrations in parts per million in the location of Mauna Loa. And this is monthly data so I’m just plotting the data here. If you look at the data you can see that it has two important features. One of them is an increasing trend as the time increases the concentration increases. And then the other very specific feature that you can see in this data set is this seasonal behavior. So right now what I’m going to do with this example is we are going to ignore the seasonal behavior, and we are going to try to fit the model that captures the linear increasing trend using a second order polynomial model.

So I’m going to just specify everything here. We are going to use the entire data set here. We’re going to analyze the entire data. We are going to read in this into a list and then we’re going to set up the DLM in matrices. So here because the model it’s a second order polynomial we are going to have a state vector. That is of dimension two the F matrix is going to be, so it’s a vector that has 1 in the first entry and 0 in the second one. And then G is this upper triangular matrix that has 1s in the diagonal and 1 above the diagonal as well. So the two parameters that we’re fitting here one of them you can view the two components in the state of theta_t parameter vector. The first component corresponds to the baseline of the level and then the second component corresponds to the rate of growth in that level that we are fitting. So just defining the F and G like that. And then V the observational variance I’m just going to set it at 10. You can play with different numbers here, and the W is a diagonal matrix with .0001 in each of the elements in the diagonal. So these models are not as flexible as the ones that we are going to consider later. So in particular we are using an assumption that the two components in the state sector are independent over time which is usually not very realistic. And we can consider more flexible models later but just to show you here how to fit these models, for the prior distribution I have again two components. So I’m going to say that a priori my baseline is 315 parts per million. And then for the second, the rate of growth is going to be 0 a priori. And then I have C0 which is this 10 times the diagonal of dimension 2 so this is an identity matrix. So is we have a diagonal with the elements in the diagonal equal to 10. So we wrap up all the DLM matrices with the functions that we defined before. And then we proceed with the filtering equations just using the forward filter function. We can obtain credible intervals for the expected value of y_t via this filtering equations.

So the reason why I’m calling it the expected value of y_t via filtering it’s just the first component of the say that theta_t vectors. So that corresponds to the level of the series, the expected value of that y_t. And then, I can compute the smoothing equations using the backward smoothing. And again I have to pass the data, the structure of the model in terms of the matrices and the results that I obtained via the filtering equations. And I can compute credible intervals for this expected value via smoothing and as we mentioned before, it has the same structure the smoothing and the filtering is just that, we call the mean and the variance mt and Ct. In the case of the filtering equations for the smoothing equations we just call them mnt and Cnt. So now we can plot all the results here. I’m just going to plot the results that correspond to the smoothing distributions just for you to see. And we can see here that is this trend that is estimated here is capturing the structure of this linear increasing trend. And you can play with different values of the signal to noise ratio. So different values of the V and the W. And if you change the values so that there is more or less signal to noise ratio, you will see that you will capture more of the seasonal structure and less of this linear trend structure. If you were to change those values. So if I go back a little bit here you can see that I have a very low signal to noise ratio and I picked this on purpose, because I didn’t want to capture any of the seasonal behavior that I observe in the series through these parameters. So I’m assuming that a lot of the variation that I see now I’m just keeping it in the noise. Just because I want to just get a very smooth estimate for this linear trend through a second order polynomial model. In practice what we’re going to do later is we really want to construct a model in which we have a component for the linear trend using the second order polynomial model. And then we add another component that will allow us to capture also the seasonal behavior that we observe in this series using a Fourier component model. So we will illustrate that later, in a separate example here is just again to show you how to use the code for specifying a second order polynomial.

98.9 Using the `DLM` package in R 🎥

The DLM library in R is a powerful tool for working with dynamic linear models. The package provides a wide range of functions for filtering, smoothing, forecasting, and parameter estimation in DLMs. In this video, we walk through the code provided in Listing 98.7.

Video Transcript

So here I’m going to show you how to use the dlm package to fit these dynamic linear models as well. So the dlm is package that is available from Cran. And it allows you to compute the filtering smoothing and forecasting equations for dynamic linear models. So I’m just going to show you how to do the same thing we’ve been doing with the code that I provided just using the dlm package. So I’m going to just run here the first examples that we ran. And I’m going to show you how to do the same again. So here, I’m just going through the Lake Huron data. So just setting up every_thing as we did before. And then going through the filtering and smoothing equations. And so we can now plot the results and just want to have all the results here. So we have the red line corresponds to the posterior mean for the distribution of \theta_t given the Dt using a first order polynomial model to fit the data. And the blue line corresponds to the smoothing mean. So the mean of the posterior distribution of the smoothing equations here. So now we can look at how to fit this with the dlm package. So you have to call, install the package if you don’t have it installed. And then just call that library once you have installed the package. And the dlm package has a different set of functions to construct the model first.

So I’m going to use the function that is called the dlmModPoly, which allows you to fit polynomial models. So it constructs the polynomial models. The default function as you can see here is a function in that assumes that the polynomial model is of order 2. So here I want to polynomial model of all the 1. And then I’m going to specify the variance at the observational level, which is called dV in that package. dW is the variance at the evolution level. And then I have my prior mean for theta and the prior variance. I’m just using exactly the same prior distribution. And the package provides two functions of the dlm filter function allows you to providing the data. And the model that you just define computes the filtering recursions here. And then there is another function that is called the dlmSmooth that you essentially pass the results of the filtering equations. And then you obtain the smoothing distributions. So we’re just going to do that. And now I’m going to plot the results that I obtained from those filtering equations. One thing that you can see here, if I do names of, let’s say results_filter_dlm. You can see that the way in which the dlm functions from the dlm package keep the results. It has a particular format. So in the case of the dlm package, you’re going to have the information about what model you fitted. Then you have the mean of theta_t given Dt is kept in this m object. And then you have a is the prior mean of theta_t, given the t -1. And then f is the mean of the one step ahead forecast distribution. And then you have these U.C, D.C, U.R, D.R, those are just decompositions of the C variance matrix. So each of the Cs at time t. And then if you have also the composition of the R matrices. So the model, the way in which the functions are implemented in this dlm package. Assume used an SVD decomposition of all the matrices. So you have to keep in mind if you’re going to recover the structure here for the different components in the model. You have to keep this in mind. So for the filtering results, this is the structure. If you do names of the results, smooth, with the dlm package. You’re going to have again, here is the mean here that is called S and then you have the decomposition of the matrix as well. So, I’m just going to plot now for the filtering results. I’m just going to plot the mean here. And then for the smoothing distribution, I’m also going to plot that means. In this case, we’re working with the first order polynomial. So the dimension of the state vector is 1. So you can see that we obtain exactly the same results. And you can compare them numerically. The upper plot corresponds to the results we get with the code that we’ve been using. And the second block corresponds to just using the code from the dlm package. We can also run the example with the second order polynomial. So again, if I use the specification of the model that we use before with the functions that we described. I can keep my results there. And if I use the dlm package, I can use again, this is a second order polynomial model. I say that the order of the polynomial is 2, I use this dlmModPoly function. I specify the observational variance, the system variance m0 and C0. So I’m using exactly the same priors in this case. And then I use the dlm filter function and the dlm smooth just to compute the moments of the filtering and smoothing distributions. And then I can plot every_thing here. We are plotting just the first component here. The posterior distribution for the first component of the theta vector. Which also corresponds to the expected value of the y_t. And then if I do the same with the dlm package, you can see that you obtain the same results. So again, the upper plot corresponds to the results that we get from the code that we’ve been using. And then the bottom plot corresponds to the results that we get from the dlm package. So I just wanted to illustrate this. You’re welcome to always use the dlm package. Just keep in mind the structure in which the matrices are kept is a little bit different than what we have been discussing. Because the dlm package uses and SVD decomposition of the covariance matrices and keeps every_thing like that. So there are some differences. But you can also use this package to obtain inference in the case of dynamic linear models.

98.10 Code: Using the `DLM` package 🗒️ \mathcal{R}

Listing 98.1: Using the dlm package for dynamic linear models

#################################################
##### Univariate DLM: Known, constant variances
#################################################
set_up_dlm_matrices <- function(FF, GG, VV, WW){
  return(list(FF=FF, GG=GG, VV=VV, WW=WW))
}

set_up_initial_states <- function(m0, C0){
  return(list(m0=m0, C0=C0))
}

### forward update equations ###
forward_filter <- function(data, matrices, initial_states){
  ## retrieve dataset
  y_t <- data$y_t
  T <- length(y_t)
  
  ## retrieve a set of quadruples 
  # FF, GG, VV, WW are scalar
  FF <- matrices$FF  
  GG <- matrices$GG
  VV <- matrices$VV
  WW <- matrices$WW
  
  ## retrieve initial states
  m0 <- initial_states$m0
  C0 <- initial_states$C0
  
  ## create placeholder for results
  d <- dim(GG)[1]
  at <- matrix(NA, nrow=T, ncol=d)
  Rt <- array(NA, dim=c(d, d, T))
  ft <- numeric(T)
  Qt <- numeric(T)
  mt <- matrix(NA, nrow=T, ncol=d)
  Ct <- array(NA, dim=c(d, d, T))
  et <- numeric(T)
  
  
  for(i in 1:T){
    # moments of priors at t
    if(i == 1){
      at[i, ] <- GG %*% t(m0)
      Rt[, , i] <- GG %*% C0 %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }else{
      at[i, ] <- GG %*% t(mt[i-1, , drop=FALSE])
      Rt[, , i] <- GG %*% Ct[, , i-1] %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }
    
    # moments of one-step forecast:
    ft[i] <- t(FF) %*% (at[i, ]) 
    Qt[i] <- t(FF) %*% Rt[, , i] %*% FF + VV
    
    # moments of posterior at t:
    At <- Rt[, , i] %*% FF / Qt[i]
    et[i] <- y_t[i] - ft[i]
    mt[i, ] <- at[i, ] + t(At) * et[i]
    Ct[, , i] <- Rt[, , i] - Qt[i] * At %*% t(At)
    Ct[,,i] <- 0.5*Ct[,,i] + 0.5*t(Ct[,,i]) 
  }
  cat("Forward filtering is completed!") # indicator of completion
  return(list(mt = mt, Ct = Ct, at = at, Rt = Rt, 
              ft = ft, Qt = Qt))
}

forecast_function <- function(posterior_states, k, matrices){
  
  ## retrieve matrices
  FF <- matrices$FF
  GG <- matrices$GG
  WW <- matrices$WW
  VV <- matrices$VV
  mt <- posterior_states$mt
  Ct <- posterior_states$Ct
  
  ## set up matrices
  T <- dim(mt)[1] # time points
  d <- dim(mt)[2] # dimension of state parameter vector
  
  ## placeholder for results
  at <- matrix(NA, nrow = k, ncol = d)
  Rt <- array(NA, dim=c(d, d, k))
  ft <- numeric(k)
  Qt <- numeric(k)
  
  
  for(i in 1:k){
    ## moments of state distribution
    if(i == 1){
      at[i, ] <- GG %*% t(mt[T, , drop=FALSE])
      Rt[, , i] <- GG %*% Ct[, , T] %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }else{
      at[i, ] <- GG %*% t(at[i-1, , drop=FALSE])
      Rt[, , i] <- GG %*% Rt[, , i-1] %*% t(GG) + WW
      Rt[,,i] <- 0.5*Rt[,,i]+0.5*t(Rt[,,i]) 
    }
    
    ## moments of forecast distribution
    ft[i] <- t(FF) %*% t(at[i, , drop=FALSE])
    Qt[i] <- t(FF) %*% Rt[, , i] %*% FF + VV
  }
  cat("Forecasting is completed!") # indicator of completion
  return(list(at=at, Rt=Rt, ft=ft, Qt=Qt))
}

## obtain 95% credible interval
get_credible_interval <- function(mu, sigma2, 
                          quantile = c(0.025, 0.975)){
  z_quantile <- qnorm(quantile)
  bound <- matrix(0, nrow=length(mu), ncol=2)
  bound[, 1] <- mu + z_quantile[1]*sqrt(as.numeric(sigma2)) # lower bound
  bound[, 2] <- mu + z_quantile[2]*sqrt(as.numeric(sigma2)) # upper bound
  return(bound)
}

### smoothing equations ###
backward_smoothing <- function(data, matrices, 
                               posterior_states){
  ## retrieve data 
  y_t <- data$y_t
  T <- length(y_t) 
  
  ## retrieve matrices
  FF <- matrices$FF
  GG <- matrices$GG
  
  ## retrieve matrices
  mt <- posterior_states$mt
  Ct <- posterior_states$Ct
  at <- posterior_states$at
  Rt <- posterior_states$Rt
  
  ## create placeholder for posterior moments 
  mnt <- matrix(NA, nrow = dim(mt)[1], ncol = dim(mt)[2])
  Cnt <- array(NA, dim = dim(Ct))
  fnt <- numeric(T)
  Qnt <- numeric(T)
  for(i in T:1){
    # moments for the distributions of the state vector given D_T
    if(i == T){
      mnt[i, ] <- mt[i, ]
      Cnt[, , i] <- Ct[, , i]
      Cnt[, , i] <- 0.5*Cnt[, , i] + 0.5*t(Cnt[, , i]) 
    }else{
      inv_Rtp1<-solve(Rt[,,i+1])
      Bt <- Ct[, , i] %*% t(GG) %*% inv_Rtp1
      mnt[i, ] <- mt[i, ] + Bt %*% (mnt[i+1, ] - at[i+1, ])
      Cnt[, , i] <- Ct[, , i] + Bt %*% (Cnt[, , i + 1] - Rt[, , i+1]) %*% t(Bt)
      Cnt[,,i] <- 0.5*Cnt[,,i] + 0.5*t(Cnt[,,i]) 
    }
    # moments for the smoothed distribution of the mean response of the series
    fnt[i] <- t(FF) %*% t(mnt[i, , drop=FALSE])
    Qnt[i] <- t(FF) %*% t(Cnt[, , i]) %*% FF
  }
  cat("Backward smoothing is completed!")
  return(list(mnt = mnt, Cnt = Cnt, fnt=fnt, Qnt=Qnt))
}


####################### Example: Lake Huron Data ######################
plot(LakeHuron) # 98 observations total

Listing 98.2: Using the dlm package for dynamic linear models

k=4
T=length(LakeHuron)-k # We take the first 
                      # 94 observations only as our data
ts_data=LakeHuron[1:T]
ts_validation_data <- LakeHuron[(T+1):98]

data <- list(y_t = ts_data)

## set up dlm matrices
GG <- as.matrix(1)
FF <- as.matrix(1)
VV <- as.matrix(1)
WW <- as.matrix(1)
m0 <- as.matrix(570)
C0 <- as.matrix(1e4)

## wrap up all matrices and initial values
matrices <- set_up_dlm_matrices(FF, GG, VV, WW)
initial_states <- set_up_initial_states(m0, C0)

## filtering and smoothing 
results_filtered <- forward_filter(data, matrices, 
                                   initial_states)

Forward filtering is completed!

Listing 98.3: Using the dlm package for dynamic linear models

results_smoothed <- backward_smoothing(data, matrices, 
                                       results_filtered)

Backward smoothing is completed!

Listing 98.4: Using the dlm package for dynamic linear models

index=seq(1875, 1972, length.out = length(LakeHuron))
index_filt=index[1:T]


par(mfrow=c(2,1), mar = c(3, 4, 2, 1))
plot(index, LakeHuron, main = "Lake Huron Level ",type='l',
     xlab="time",ylab="feet",lty=3,ylim=c(575,583))
points(index,LakeHuron,pch=20)
lines(index_filt, results_filtered$mt, type='l', 
      col='red',lwd=2)
lines(index_filt, results_smoothed$mnt, type='l', 
      col='blue',lwd=2)


# Now let's look at the DLM package 
library(dlm)
model=dlmModPoly(order=1,dV=1,dW=1,m0=570,C0=1e4)
results_filtered_dlm=dlmFilter(LakeHuron[1:T],model)
results_smoothed_dlm=dlmSmooth(results_filtered_dlm)

plot(index_filt, LakeHuron[1:T], ylab = "level", 
     main = "Lake Huron Level",
     type='l', xlab="time",lty=3,ylim=c(575,583))
points(index_filt,LakeHuron[1:T],pch=20)
lines(index_filt,results_filtered_dlm$m[-1],col='red',lwd=2)
lines(index_filt,results_smoothed_dlm$s[-1],col='blue',lwd=2)

Listing 98.5: Using the dlm package for dynamic linear models

# Similarly, for the second order polynomial and the co2 data:
T=length(co2)
data=list(y_t = co2)

FF <- (as.matrix(c(1,0)))
GG <- matrix(c(1,1,0,1),ncol=2,byrow=T)
VV <- as.matrix(200)
WW <- 0.01*diag(2)
m0 <- t(as.matrix(c(320,0)))
C0 <- 10*diag(2)

## wrap up all matrices and initial values
matrices <- set_up_dlm_matrices(FF,GG, VV, WW)
initial_states <- set_up_initial_states(m0, C0)

## filtering and smoothing 
results_filtered <- forward_filter(data, matrices, 
                                   initial_states)

Forward filtering is completed!

Listing 98.6: Using the dlm package for dynamic linear models

results_smoothed <- backward_smoothing(data, matrices, 
                                       results_filtered)

Backward smoothing is completed!

Listing 98.7: Using the dlm package for dynamic linear models

#### Now, using the DLM package: 
model=dlmModPoly(order=2,dV=200,dW=0.01*rep(1,2),
                 m0=c(320,0),C0=10*diag(2))
# filtering and smoothing 
results_filtered_dlm=dlmFilter(data$y_t,model)
results_smoothed_dlm=dlmSmooth(results_filtered_dlm)

par(mfrow=c(2,1), mar = c(3, 4, 2, 1))
plot(as.vector(time(co2)),co2,type='l',xlab="time",
     ylim=c(300,380))
lines(as.vector(time(co2)),results_filtered$mt[,1],
      col='red',lwd=2)
lines(as.vector(time(co2)),results_smoothed$mnt[,1],
      col='blue',lwd=2)

plot(as.vector(time(co2)),co2,type='l',xlab="time",
     ylim=c(300,380))
lines(as.vector(time(co2)),results_filtered_dlm$m[-1,1],
      col='red',lwd=2)
lines(as.vector(time(co2)),results_smoothed_dlm$s[-1,1],
      col='blue',lwd=2)

98.1 Filtering 🎥

98.2 Summary of filtering distributions 🗒️

98.2.1 Bayesian Inference in NDLM: Known Variances

98.2.1.1 Filtering

98.3 code Filtering in the NDLM: Example 🗒️ \mathcal{R}

98.4 Smoothing and forecasting 🎥

98.4.0.1 Smoothing

98.4.0.2 Forecasting

Smoothing

the NDLM we will be inferring

Recall what is Filtering?

Recall what is Smoothing?

Filtering illustrated

98.4.1 Forecasting

98.5 Summary of the smoothing and forecasting distributions 🗒️

98.5.1 Bayesian Inference in NDLM: Known Variances

98.5.1.1 Smoothing

98.5.1.2 Forecasting

98.6 Smoothing in the NDLM, Example 🎥

98.7 Code: Smoothing in the NDLM, Example 🗒️ \mathcal{R}

98.8 Second order polynomial: Filtering and smoothing example 🎥

98.9 Using the DLM package in R 🎥

98.10 Code: Using the DLM package 🗒️ \mathcal{R}

98.9 Using the `DLM` package in R 🎥

98.10 Code: Using the `DLM` package 🗒️ \mathcal{R}