100  Bayesian Inference in the NDLM: Part 2

Time Series Analysis

Normal Dynamic Linear Models (NDLMs) are a class of models used for time series analysis that allow for flexible modeling of temporal dependencies.
Bayesian Statistics
Time Series
Author

Oren Bochman

Published

November 9, 2024

Keywords

Time series, Filtering, Smoothing, NDLM, Normal Dynamic Linear Models, Seasonal NDLM, Superposition Principle, R code

100.1 Filtering, Smoothing and Forecasting: Unknown observational variance 🎥

  • This material seems unnecessarily complicated:
  1. This section seems like a lots of mathematical gobbledygook without any context of motivation for application for each of the equations being presented.
  • After writing everything out and reviewing it multiple times, I began to develop a purely mathematical intuition i.e. I can relate them to the the general Bayesian inference framework,
  • I can’t however say I am much closer to using them in engineering solution based on this mathematics.
  • I doubt that doing a single capstone project will make much of a difference.
  1. While we will see three application in this module, (Nile river data, google trends, and EEG data). However, what I need to understand for each of these filtering, smoothing and forecasting distributions why are they useful rather than the original ones. I mean when we use to solve a specific challenge, for a dataset.

  2. Perhaps the root cause here is that this course fails to acknowledge the connection of NDLM to the Kalman Filters and the much wider literature on Kalman Filters than NDLM. It appears to me that in the first course of the Kalman filtering specialization (KF bootcamp) there are a number of application that are presented each with a different requirements and that if we map the different challenges to the NDLM framework we should be able to navigate the math more easily.

  3. Anyhow a lookup indicates that this model is called in the state-space modeling terminology as local level model for local level or trend filtering, such as in structural time series models, and in the Kalman filtering terminology as a KF with variance learning.

In the case of the NDLM, where we assume That the observational variance \nu_t=v i.e. is constant over time.

100.1.1 Inference in the NDLM with unknown constant observational variance:

Let \nu_t = v\ \forall t with \nu unknown and consider a DLM with the following structure: \begin{aligned} y_t &= \mathbf{F}_t^\top \boldsymbol{\theta}_t + \nu_t, &\nu_t &\sim \mathcal{N} (0, \nu) &\text{(obs)}\\ \boldsymbol{\theta}_t &= \mathbf{G}_t \boldsymbol{\theta}_{t-1} + \boldsymbol{\omega}_t, & \boldsymbol{\omega}_t &\sim \mathcal{N} (0, \nu \mathbf{W}^*_t) &\text{(sys)} \end{aligned} \tag{100.1}

  • where we have the usual suspects:
    • An observation equation, relating the observed y_t to the hidden state \theta_t plus some sensor noise \nu_t.
    • A state equation, which describes how the hidden state \theta_t evolves plus some process noise \boldsymbol{\omega}_t,
    • \mathbf{F}_t is a known observation operator,
    • \mathbf{G}_t is a known state transition operator,
    • \boldsymbol{\theta}_t is the hidden state vector, and
    • \boldsymbol{\omega}_t is the process noise, e.g. friction or wind.
    • \mathbf{W}_t is a known covariance matrix, for process noise.
ImportantQuestion: why does \nu appear in the w_t distribution equation?

Prado says is also assuming that the system noise is conditional on the variance \nu. Yet this seems to be an assumption that should be justified.

In fact, What we are missing is a logical Model for this model i.e. simple use case

with conjugate prior distributions: \begin{aligned} (\boldsymbol{\theta}_0 \mid \mathcal{D}_0, \nu) &\sim \mathcal{N} (\mathbf{m}_0, \nu\ \mathbf{C}^*_0) \\ (\nu \mid \mathcal{D}_0) &\sim \mathcal{IG}\left(\frac{n_0}{2}, \frac{d_0}{2}\right ) \\ d_0 &= n_0 s_0 \end{aligned} \tag{100.2}

100.1.2 Filtering

We now consider filtering updating uncertainty. Which lets us infer current time step parameters or observation based on the previous time step’s information and \nu as being normal distributed.

we will be interesting the following distributions:

  • p(\boldsymbol{\theta}_t \mid \mathcal{D}_{t−1}) the prior density for the state vector at time t given information up to the preceding time;
  • p(y_t \mid \mathcal{D}_{t−1}) the one-step-ahead predictive density for the next observation;
  • p(\boldsymbol{\theta}_t \mid \mathcal{D}_t) the posterior density for the state vector at time t given \mathcal{D}_{t−1} and y_t;
  • the h-step-ahead forecasts p(y_{t+h} \mid \mathcal{D}_t) and p(\boldsymbol{\theta}_{t+h} \mid \mathcal{D}_t);
  • p(\boldsymbol{\theta}_t \mid \mathcal{D}_T) the smoothing density for \boldsymbol{\theta}_t where T > t.

Assuming:

(\boldsymbol{\theta}_{t-1} \mid \mathcal{D}_{t-1}, \nu) \sim \mathcal{N} (\mathbf{m}_{t-1}, \nu\ \mathbf{C}^*_{t-1})

By filtering we obtain the next time step parameters conditional on the previous time step’s information and the variance \nu as being normal distributed.

\begin{aligned} (\boldsymbol{\theta}_t \mid \mathcal{D}_{t-1}, \nu) &\sim \mathcal{N} (\mathbf{a}_t, \nu \mathbf{R}^*_t) \\ \mathbf{a}_t &= \mathbf{G}_t \mathbf{m}_{t-1} \\ \mathbf{R}^*_t &= \mathbf{G}_t \mathbf{C}^*_{t-1} \mathbf{G}'_t + \mathbf{W}^*_t \end{aligned} \tag{100.3}

We can also marginalize \nu by integrating it out, we have parameters that are unconditional on \nu:

\begin{aligned} (\boldsymbol{\theta}_t \mid \mathcal{D}_{t-1}) &\sim T_{n_{t-1}} (\mathbf{a}_t, \mathbf{R}_t) \\ \mathbf{R}_t &\equiv s_{t-1} \mathbf{R}^*_t \end{aligned} \tag{100.4}

with s_t\ \forall t given in Equation 100.5.

  • (y_t \mid \mathcal{D}_{t-1}, \nu) \sim \mathcal{N} (f_t, \nu q^*_t)
    • with
      • f_t = F'_t \mathbf{a}_t
      • q^*_t = (1 + F'_t \mathbf{R}^*_t F_t)

and unconditional on \nu we have

  • (y_t \mid \mathcal{D}_{t-1}) \sim T_{n_{t-1}} (f_t, q_t)

    • with
      • q_t = s_{t-1} q^*_t
  • (v \mid \mathcal{D}_t) \sim \mathcal{IG}\left( \tfrac{n_t}{2}, \tfrac{s_t}{2}\right )

    • with
    • n_t = n_{t-1} + 1 s_t = s_{t-1} + \frac{s_{t-1}}{n_t} \left ( \frac{e^2_t}{q^*_t} - 1 \right ), \tag{100.5}

    here e_t = y_t - f_t

  • (\theta_t \mid D_t, v) \sim N (m_t, vC^*_t),

    • with
      • m_t = a_t + A_t e_t, and
      • C^*_t = R^*_t - A_t A'_t q^*_t.

Similarly, unconditional on \nu we have

  • (\boldsymbol{\theta}_t\mid \mathcal{D}_t) \sim T_{n_t} (\mathbf{m}_t, \mathbf{C}_t)
    • with
      • \mathbf{C}_t = s_t \mathbf{C}^*_t

100.1.3 Forecasting

Similarly, we have the forecasting distributions:

  • (\boldsymbol{\theta}_{t+h} \mid \mathcal{D}_t) \sim T_{n_t} (\mathbf{a}_{t}(h), \mathbf{R}_{t}(h)),

  • (y_{t+h} \mid \mathcal{D}_t) \sim T_{n_t} (f_{t}(h), q_{t}(h)),

  • with

    • \mathbf{a}_{t}(h) = \mathbf{G}_{t+h} \mathbf{a}_{t}(h-1),
    • \mathbf{a}_{t}(0) = \mathbf{m}_t, and

\begin{aligned} \mathbf{R}_{t}(h) &= \mathbf{G}_{t+h} \mathbf{R}_{t}(h-1) \mathbf{G}'_{t+h} + \mathbf{W}_{t+h} \\ \mathbf{R}_{t}(0) &= \mathbf{C}_t \end{aligned}

f_{t}(h) = \mathbf{F}'_{t+h} \mathbf{a}_{t}(h) \qquad q_{t}(h) = \mathbf{F}'_{t+h} \mathbf{R}_{t}(h) \mathbf{F}_{t+h} + s_{t}

100.1.4 Smoothing

Finally, the smoothing distributions have the form:

(\boldsymbol{\theta}_t \mid \mathcal{D}_T) \sim T_{n_T} (a_T (t - T), R_T (t - T) s_T / s_t)

  • where -a_T (t - T) and R_T (t - T) with a_T (t - T) = m_T - B_T [a_{T+1} - a_T (t - T + 1)]

R_T (t - T) = C_T - B_T [R_{T+1} - R_T (t - T + 1)] B'_T

with

B_T = C_T G'_{T+1} R^{-1}_{T+1}, \quad a_T (0) = m_T , \quad R_T (0) = C_T.

We will now discuss the filtering, and smoothing, and forecasting distributions. In the case of the normal dynamic linear model, where we assume that the observational variance is constant over time. So we’re going to call that variance v and so vt is equal to v for all t, but v is unknown. So we’re going to place a prior distribution on that variance. We are going to write down the model in terms of again, an observation equation and a system equation. The observation equation has the same form that we discussed before. We’re now the observational variance vt is changed to be always v, but now v is unknown. And then the system equation we are going to write down as an equation that is conditional on the variance v. So we have that say theta t is equal to Gt theta t minus one plus the system noise. And we are assuming that this is normally distributed with mean zero and now the variance co-variance matrix is v times Wt star. So here we assume then that Wt star is a known quantity for all the times t from one and so on. And then we’re also assuming that Ft and Gt are given. So what kind of prior distribution can we use now? The structure of the prior in this case that is conjugated with the model that we are proposing is a normal inverse gamma prior that means for theta zero given the zero we condition on v. We’re going to use a normal prior distribution with mean m0 and variance co-variance matrix v times C0 star. And then the distribution for v given D0 is an inverse gamma with parameters n0 half and d0 half. d0 can be written in terms of n0 times s0, so here n0 is the degrees of freedom parameter. So the higher the value of the degrees of freedom is the more information you are inputting in this prior. And then s0 is a prior estimate of the variance v before you obtain any observations, so this is how you can view the prior. So this is a conditional prior structure, if you use this prior structure, you can obtain the filtering equations in closed form. The smoothing equations also in closed form. So we are going to discuss those equations now, so again, we are assuming that at time t- 1. We have this conditional structure in which the prior the posterior at time t -1, given the t -1, and v has this normal structure conditional on v. And then we given the t minus one, has this inverse gamma structure. Now, my parameters are indexed it in time, do I have nt- 1/2, and dt- 1/2, and the dt- 1 has the same structure, nt- 1 times St- 1. So, if I have this condition structure, we can show, and this structure, as we said, is true for t = 0, as we wrote down before. We can show that the prior distribution at time t given Dt- 1 and conditional on v is a normal at vRt star. And now I have equations just to compute the value of the mean at is Gt * mt- 1. So this is exactly the same equation we had before when we were discussing the case of the DLM with known variances at both the observational and system level. And now we have thi Rt star that has the structure Gt Ct- 1, Gt transpose plus Wt star. And then we can obtained as well the one step ahead forecast distribution for yt given Dt- 1, conditional on v is given by this equation here. It’s a normal Ft vqt star, where Ft is Ft transports t, and qt star is Ft transpose, t, Rt star Ft + 1. So we have again, conditional structures on v and then we can obtain the posterior distribution for the variance after I get my first observation, yt. And this is going to be given in terms of an inverse gamma nt over to dt Over two. And the degrees of freedom parameter is updated in such a way that it was the number of degrees of freedom you had at step t- 1 + 1. This plus one has to do with the observation you just collected. And then you can compute the dt as nt * St where St is given by this equation here, so it depends on St- 1. It depends on the number of degrees of freedom, it depends on et which is the given by yt- ft, so that’s the prediction error. So yt is the new observation you just obtained and then qt is st- 1qt star. Finally, you can then obtain the posterior distribution for thetat given Dt conditional on v that’s a normal mt vCt star. And you can obtain the equations for the moments of that normal distribution using this equation for the mean. And then we have Ct star this is the equation that allows us to update the co-variance matrix. So this is the distributions we have conditional on v for theta t and yt. We can also write down, we can integrate out the variance here. And when we integrate out the variance, given that the variance has an inverse gamma distribution, we obtain for theta t now unconditional on v. So it only depends on the information up to t- 1, that’s a student t distribution with nt- 1 degrees of freedom and at and Rt are. The location and scale parameters for that distribution, so Rt is St- 1Rt star. Similarly, we get students distributions for the one step ahead prediction distribution for yt given Dt- 1 and for theta t given Dt the posterior for theta t at time, t is a student T nt mt Ct. So we can just write everything conditional on v as we did before. Or if we want to just look at the marginal distributions and conditional on v, we obtained student t distributions. We now discuss the smoothing equations in the case of the unknown variance at the observational level. So we have the same type of notation we’re assuming here we’re looking at the distribution of theta t given D capital T. So capital T here is greater or equal than the lower case t. And before we have a normal distribution now we’re going to have a student T distribution. The number of degrees of freedom is going to be n, capital T. And then we’re going to have mean a capital T, lowercase t- capital T. And there is going to be a recursion for the means of this distribution. And then we have for the scale parameter here we have R capital T(t- T). And then that’s multiplied by s capital T, divided by s lower case t. So in this case we have if you recall this is going to be the estimate that we’re going to get for the variance after receiving capital T observations. While this one is the estimate for the observational variance that we get from the model after receiving lower case t observations. So the recursions are going to work in the same way. I’m not writing down the equations, they are just exactly the equations that we had before. And then for forecasting if I want to look at the distribution of theta t + h. So h here is greater equal than zero given the information up to Dt, that one is going to be a student T with nt degrees of freedom. And then I have my recursions also for at(h) and Rt(h) as we had before. So as in the case with the known observational variance, we are going to need the Gt + h here for all the values of h to compute these moments and also the Wt + h for all the values of h. So we need to specify those matrices in there. Everything else stays the same, and now we have that for the predictive distribution for h steps ahead for yt. Given Dt is also now a student T with nt degrees of freedom, we have ft(h) and qt(h). The ft(h) is as before, the qt(h) is slightly different now because we don’t have vt + h. We’re assuming that v is constant and unknown. So what goes in here, you can see is the estimate that we have for that variance v at time t. So this allows us to compute the forecasting distributions.

100.2 Summary of Filtering, Smoothing and Forecasting Distributions, NDLM unknown observational variance 🗒️

100.3 Specifying the system covariance matrix via discount factors 🎥

We have discussed different cases in terms of the specification of the vt, the observational variance. Now, I will talk about how to specify the system variance, assuming that that system variance is not known just to recall the cases we have discussed, suppose that we have a DLM. That looks like this, so we have talked about the case in which vt and Wt, the observation and system variances are known for all the values of t. And we said that we can perform filtering and smoothing in that case using a normal prior distribution. We get results that are for the filtering distributions, we also get normal distributions for the smoothing distribution, we get normal distributions as well. Then we also discussed the case in which, so I guess, let me just clarify here, so we’ve done vt and Wt known for all t. Okay, we also said discussed the case in which we assume that vt is v, so it’s constant over time v known and sorry, v unknown and Wt known for all t. Again, we can use a conjugated structure for v unknown, we impose a normal inverse gamma prior structure. And then we can obtain closed form filtering equations and smoothing equations in that particular case. But we’re assuming here that all the Wts, the system covariance matrices are unknown. Now we are going to discuss the case in which the Wt is unknown, so how do we deal with this? So I’ll begin with assuming that vt is known for all t and we will discuss then what happens if the vt is unknown and now I want to specify my Wt. So if you recall when we were working with the equations of the normal dynamic linear model. There was the variance of the state vector theta_t that the given the information we have up to t- 1. And if you remember the equations, we had two components in here, we had called that Rt and that Rt has Gt, Ct- 1, Gt transpose plus Wt. I’m going to call this first component here in this nation, I’m going to call it Pt and I can think of this Pt as the variance, the prior variance of theta_t given Dt- 1. If I were to consider a model in which this Wt is just zero, in other words, a model in which there is no additional stochastic component in each time at the system level. So in this case, if I call Pt ,Gt, Ct- 1, Gt transpose corresponds to the prior variance in a model of the form Ft, Gt, vt and then I have a zero here. So there is no so fast, there are no stochastic changes which is equivalent to saying that there are no changes over time in the parameters of the model. In the state, parameters of the model. A way to specify then my WT here, which is what I want to do in this particular case is to assume that my Rt, my prior variance in the general model. The variance of theta_t given Dt-1 is this Pt component but I’m going to increase the uncertainty when I think about that variance. So I’m going to say that my Rt is just going to be Pt divided by delta with delta. some factor here that I’m going to call discount factor that is between zero and one. And I’m including the one here when delta is exactly one, what I’m doing is I’m saying there is no stochastic component. I’m going to be having essentially my variance for theta_t given Dt-1 corresponds to the variance of model that looks like this. And in that case, the parameters of the model are not changing over time, I don’t have any changes over time, every, all the parameters stay the same over time. That’s the case in which delta is equal to one and if delta is smaller than one, what I’m doing here by dividing by this delta factor is. I am increasing the uncertainty with respect to whatever I had in t- 1, so the closer delta is to zero the more uncertainty I’m going to have. And therefore the more variation I’m going to have in the parameters of the model, the closer delta is to one, the less variation I’m going to have in the model parameters. So if I were to now combine this expression with also this expression, right, what I get, so if I say Rt is Pt plus Wt but it’s also Pt divided by delta. What I get here is that Wt is fully specified as 1 minus delta divided by delta Pt, so I have an expression now for Wt that is given in terms of this discount factor and my Pt. If you look at the expression for Pt, Pt always depends on Gt Ct-1 and Gt transpose here, so you only need the Gt and Ct-1. The first time you compute that is when you consider P1, which is just, you’re going to use your G1, your C0, which is your prior variance for the state vector. And that one you provide and this is something that the modeler provides, the user provides and then you have the G. So once you have this P1, you plug it in here and you’re going to have the W1 and you can continue going through this process, you continue with your filtering equations. All you need is this C0 and the discount factor to fully specify the Wt, in the case, so this is again for the case in which Vt is known for all time t, for all the times t. And you are specifying wt in this way, you can also consider the case in which the W, the vt is constant and unknown. So there is the case in which vt is v with v unknown and the structure in this case for the model is going to be very similar, you can just specify the discount factor in the same way. Now, instead of having Wt, if you go back to the equations that we discussed, you’re going to have a Wtstar. And the Wt star, you can write down in terms of a P star, where the P star is going to be Gt, Ct minus one star Gt. So again, everything comes here with a star, that is just writing that model in terms of the conjugate structure as we described before. So, here, you can fully specify Wt star as one minus delta, divided by delta and then you have that Pt star with and Pt star is going to be given by Gt, Ct- 1 star, Gt transpose. And again, this provides the structure for the Wt star, now, you write down your model, your equations conditional on v. You have the normal distributions for the state vector and then you also impose a prior distribution of v that is an inverse comma. And then go through your equations again for the filtering and smoothing and as we described before. So once again, all you need here is you need your CO star and your discount factor and you have your specification for the Wt star. So how do we choose that discount factor in practice, we have delta in (0,1], so that’s what the theory says and I have to pick a particular value in practice. How do we proceed if I have a model and what is the ideal discount factor value that I can use to specify that Wt matrix. Again, in practice, the first thing that happens is we usually consider discount factor values that are above .8. So usually we just work, it’s not always the case but usually we consider discount factor values that are above .8. So you can think of just considering a grid of values in the interval .8 to 1, including the one and then find a method that allows you to pick that value optimally. You could use different kinds of properties of the model to do this. In particular, you can use the one step ahead predictive distributions to give you an idea of what is the best discount factor value you can take. So you can write down if you think about the predictive distributions, you can write down p. So you have all your data here and then you have the information at, time zero and then you have specified a particular DLM with an Ft, Gt and so on. And then when you write this down, you can write this down as a product of those one step ahead densities here. So if you take the log and this, let me just write down that this is also going to depend on whatever discount factor you have for that particular model specification. So if I take the log of this and you can write this p here inside as a product of the individual wise. And so when you take the log, you’re going to have the sum of the logs from one or 2 capital T of the log of these p, yt, Dt minus one. And then this also depends on delta, so this is the one step ahead predictive distribution. So you have those densities and it’s going to be a function of delta in the case of a model in which you know, vt and Wt for all the values of t. You know that those are normal in the case here, we’re specifying Wt using a discount factor, so I’m going to have a normal density if I have vt known. So I can look at this as a function of, let’s call it l of delta as a function of delta and I can view this as a likelihood function for delta, so if I evaluate this in a grid of values. I can pick the so called optimal delta by maximizing and depending on whether or not you’re using a model with the vt known or a model in which vt is v. But v is unknown and you have the conjugate prior structure, you’re going to have normal distributions in the first case here and students distributions in the second case. But you can compute this for each value of delta, if you have a grid of values, you just pick the optimal delta to be this, the one that maximizes this function. This is not the only function you could use, you could also consider other alternatives, you could consider, so this is one option. You have another option in which you can write down a mean square error again as a function of the discount factor for each model that you specify. And you can write down this as, over T, so here, I’m writing et as you remember, it’s just that whatever you observe minus the prediction you get from your model is usually ft. But that prediction depends on delta, so that’s why I’m writing down et as a function of this delta. So here you could pick the optimal discount factor by minimizing this mean square error, so here the optimal delta is chosen by minimizing this function of delta. You could also use other measurements, like the mean absolute deviation and then again minimize this. There are other approaches that could be using practice, you could consider that delta is another parameter in your model. You could put a prior on the delta and you can try to get inference on that value and then again, another thing you could do is if you have multiple components in your model. You could use a discount factor that is dependent on the components, so you could have as many discount factors as components in your model. And use the same idea to specify the W matrix in this block diagonal form, if you were to do that instead of having a function that is just, is univariant on delta. You will have multiple deltas and you will have to pick the optimal values by maximizing that function. Or in the case of the MSE, you will have to pick the optimal values of those deltas by minimizing these functions, so there is multiple alternatives. In this course, we are going to be working only with a single discount factor and we can pick the optimal either using this method or this other method that I just described.

100.4 NDLM, Unknown Observational Variance: Example 🎥

This is a walk though of the R code in ?lst-NDLM-unknown-variance

We now discuss and illustrate some R code that allows you to obtain the filtering, smoothing, and forecasting distributions of a normal dynamic linear model that assumes that the observational variance is unknown. We call that V and the system covariance matrix Wt is assumed either known or specified using a discount factor. The code allows you to consider both options. The code also allows you to work with time-varying Fts and Gts, and state parameters that incorporate different components such as polynomial trends, regression, periodic components via Fourier representations, and so on. I’m going to go first through the functions, and then we will illustrate these analyzing a dataset that is available in R. As before, I’m putting all the functions into a single file that you can source. The first function that we have is a function that setups the DLM matrices that you specify, assuming that the observational variances is unknown, so you can pass Ft and Gt for all the times that you are considering. Then you have the option also of passing Wt_star, which is just in the case where you assume that you are going to specify the structure of the covariance of the noise at the system level, then you would pass this Wt_star. If you don’t pass it, then the function assumes that you’re working with a discount factor. Here, the function just returns, puts all those matrices into a single object. Similarly, there is a function that puts all the prior parameters into a single list. This is again an object that you can then pass to the filtering, smoothing, and forecasting functions. Here, if you remember the conjugate structure conditional on the observational on variance V, the state vector, Theta0 is distributed as normal m0, vC0*. Then the prior distribution for the variance has a degrees of freedom parameter that we call in n0, and then you have these S0 that you can think of as an initial estimate for the variance. This is just a structure of the prior distribution that completes this normal inverse Gamma structure. Then you have the forward filtering function. Again, here you pass the data, the matrices that define the structure of the model, the initial states that you created with these functions. Then you have the option of passing a Delta or a grid of values. We will also talk about this, but this is just a value Delta for the discount factor. If you don’t specify the Delta, the function assumes that you are working with a case in which you know the structure of the Wt_ star. If you pass the Delta, then the function assumes that you are using a discount factor to specify that Wt matrix. If you look into the function, you can see that it has a similar structure to the functions we have discussed before now, the Fts are allowed to be time-varying. Then you have a different prior structure, and then you can obtain the moments for the prior distribution of Theta t given Dt minus 1. There are two options if you look into the Code, one for the case in which you don’t have a discount factor and one for the case in which you have a discount factor. Then you can continue with the moments of the one-step ahead forecast. Again, these are the equations and the moments of the posterior distribution for Theta t given Dt. The function returns an object that contains not only mt, Ct, at, Rt, ft, Qt and et, as we had before, it also has the degrees of freedom parameters and the Sts for each time t. All these is contained into this object that the function returns. Then there is a function to compute the smoothing distributions, the function receives as input, the data, the matrices that define the model structure, and this object that is the posterior states. You first have to run the filtering function, keep whatever the function returns into an object and then pass that object. This is the posterior states object. Then again, you have the option of providing a discount factor or not. If you look into the equations, you will see again, that you have the smoothing equations are updated in terms of the equations we discussed for the case in which there is no delta. Then in that case, we are assuming that the Wt structure is provided by the user, and in the case in which a discount factor is specified. Then we have also the list that returns. Essentially this function returns a list with the moments for the smoothing distribution of the state parameter vector, and also for the mean response. We also have then a function that allows you to compute the k steps ahead forecast distribution. The function receives the posterior states. k here is the number of steps ahead you want to compute. You have the matrices and then again, you may or may not have a delta, which is a discount factor. It computes the moments of that forecast distribution and it returns the moments for, again, at and Rt that correspond to the state vector, and then for the mean response or the k steps ahead, forecast function here. Going now to the example, I’m going to show you how to use these functions using a dataset that R provides. We will first source the functions that we just discussed and then we’re going to use these dataset that I’m plotting right here. This corresponds to 100 observations. They are annual measurements of the Nile river level in a particular location. The units are 10^8 square meters in a period of time that goes from 1871-1970. We’re going to use our first order polynomial DLM, which is a structure that we have used before. But now we are going to assume that the variance of the observational level is unknown, and constant over time. We’re going to also estimate this variance. The data contains 100 observations. We’re going to break these into two components. We are going to use 95 observations to fit the model, and then we’re going to predict the last five observations and see what we obtain. Here we’re just specifying the data, what I call the test data here it shows what you want to forecast. We set up the matrices for the first order polynomial. You notice here that I’m defining arrays that have the dimension of whatever is the model that you are using. In this case, a first order polynomial model has only one parameter. We have to specify Ft for all the values of t. Here, n is the total number of observations. I have to specify that for all the time points that I’m going to use to fit the model, and then also the ones that I’m going to predict the five last observations. That’s why the dimension is n, the Gt in the first order polynomial is constant equal to one. We specify that. Then we have this Wt star that I’m going to assume is just also one for all the values of t. After that we can specify the prior parameters for theta0 given V and D0. Here I’m assuming a priori, the mean level is 800 and the C0star is 10. The degrees of freedom parameter, I’m setting it to be one, and then the initial estimate for the observational variance is 10. Once I define this structure, I use those functions to set up all these matrices and initial parameters in terms of objects to pass through the filtering distribution, this smoothing distributions and so on. Here the first thing, we compute the moments of the forward filtering distributions, assuming that V is unknown, so we pass the data, the matrices, and the initial states. We can also compute 95 percent credible intervals for the state vectors. In this case is a one-dimensional vector so we just pass the results that we obtain using the filtering equations. These are the moments, the expected value of the distribution, and the variance of that distribution as well as the degrees of freedom parameter. Remember that when we integrate out V, assuming this distributional structure that we have for V, we obtain that the distribution of Thetat given Dt is a student t-distribution, so you have to pass the degrees of freedom parameter here. We then can proceed with the smoothing. The smoothing function is going to use the data, the matrices, and the object that contains all the filtering results in here. Again, we can compute credible intervals for, now it would be the distribution of the state parameter vector at each time t, considering that we have received capital T observations. We then proceed with the one step ahead or k steps ahead forecast. In this case, k is going to be equal to 5. Then again, I can compute credible intervals for the distribution of yt plus k given Dt, in this case the ts we’re going to go all the way to capital T. We then plot the results. Here we first plot the dots here correspond to the actual observations and then the first thing that I’m going to plot is the filtering distribution, meaning the mean of the filtering distribution as well as the 95 percent credible interval. This corresponds to Theta t given Dt as I receive observations here. Then you can plot also the smoothing distribution so once you compute the filtering, you are going to go and revisit these distributions. You can see that the mean of this distribution is smoother than the one we computed with the filtering equations. Also you can see the uncertainty here also is smaller, so these are the smooth distributions. Because this is a first-order polynomial, if we think about the forecast function is going to be constant after that last data point, capital T so it’s going to have that structure over time. Then I can also compute credible intervals associated to that distribution. You can see that the pattern that as the number of steps ahead increases, the uncertainty increases. This results in the the posterior distributions again for filtering distributions for Theta t given Dt and then the smoothing distributions and the forecast distributions.

100.5 code: NDLM, Unknown Observational Variance Example 🗒️ \mathcal{R}

This code allows time-varying F_t, G_t and W_t matrices and assumes an unknown but constant \nu. It also allows the user to specify W_t using a discount factor \delta \in (0,1] or assume W_t known.

100.5.1 Nile River Level Filtering Smoothing and Forecasting Example

source('all_dlm_functions_unknown_v.R')
source('discountfactor_selection_functions.R')

## Example: Nile River Level (in 10^8 m^3), 1871-1970 
## Model: First order polynomial DLM
plot(Nile) 

n=length(Nile) #n=100 observations 
k=5
T=n-k
data_T=Nile[1:T]
test_data=Nile[(T+1):n]
data=list(yt = data_T)


## set up matrices for first order polynomial model 
Ft=array(1, dim = c(1, 1, n))
Gt=array(1, dim = c(1, 1, n))
Wt_star=array(1, dim = c(1, 1, n))
m0=as.matrix(800)
C0_star=as.matrix(10)
n0=1
S0=10

## wrap up all matrices and initial values
matrices = set_up_dlm_matrices_unknown_v(Ft, Gt, Wt_star)
initial_states = set_up_initial_states_unknown_v(m0, 
                                      C0_star, n0, S0)

## filtering 
results_filtered = forward_filter_unknown_v(data, matrices, 
                                            initial_states)
Forward filtering is completed!
ci_filtered=get_credible_interval_unknown_v(results_filtered$mt, 
                                    results_filtered$Ct, 
                                     results_filtered$nt)

## smoothing
results_smoothed=backward_smoothing_unknown_v(data, matrices, 
                                             results_filtered)
Backward smoothing is completed!
ci_smoothed=get_credible_interval_unknown_v(results_smoothed$mnt, 
                                         results_smoothed$Cnt, 
                                         results_filtered$nt[T])

## one-step ahead forecasting
results_forecast=forecast_function_unknown_v(results_filtered, 
                                                k,  matrices)
Forecasting is completed!
ci_forecast=get_credible_interval_unknown_v(results_forecast$ft, 
                                          results_forecast$Qt, 
                                     results_filtered$nt[T])
## plot results
index=seq(1871, 1970, length.out = length(Nile))
index_filt=index[1:T]
index_forecast=index[(T+1):(T+k)]

plot(index, Nile, main = "Nile River Level ",type='l',
     xlab="time",ylab="feet",lty=3,ylim=c(400,1500))
points(index,Nile,pch=20)

lines(index_filt,results_filtered$mt, type='l', col='red',lwd=2)
lines(index_filt,ci_filtered[, 1], type='l', col='red', lty=2)
lines(index_filt,ci_filtered[, 2], type='l', col='red', lty=2)
lines(index_filt,results_smoothed$mnt, type='l', col='blue',lwd=2)
lines(index_filt, ci_smoothed[, 1], type='l', col='blue', lty=2)
lines(index_filt, ci_smoothed[, 2], type='l', col='blue', lty=2)

lines(index_forecast, results_forecast$ft, type='l', 
      col='green',lwd=2)
lines(index_forecast, ci_forecast[, 1], type='l', 
      col='green', lty=2)
lines(index_forecast, ci_forecast[, 2], type='l', 
      col='green', lty=2)
legend("topright",
       legend = c("Observed", "Filtered mean", "Filtered CI", 
                  "Smoothed mean", "Smoothed CI", 
                  "Forecast mean", "Forecast CI"),
       col = c("black", "red", "red", "blue", "blue", "green", "green"),
       lty = c(NA, 1, 2, 1, 2, 1, 2),
       pch = c(20, NA, NA, NA, NA, NA, NA),
       lwd = c(NA, 2, 1, 2, 1, 2, 1),
       bty = "n",
       ncol = 2)
Figure 100.1: Nile River Level Filtering, Smoothing and Forecasting Results
NoteOverthinking the Nile River Data

We use the Nile dataset throughout the course, despite the Nile data being notorious for having long term dependencies and being highly non-linear, which might present many challenges for the NDLM framework. c.f. the joint work of two giants Benoit Mandelbrot and Harold Edwin Hurst. Which led to the development of the Hurst exponent and the fractional Brownian motion, i.e. fractional Gaussian noise and fractional ARIMA ARFIMA which are more suitable for modeling the Nile data but not covered in this course.

Just a couple of concepts here:

In Fractional Brownian Motion (fBm) draws from the Normal need not be independent. With the concept of Hurst exponent H which is a measure of the long-term memory of a time series. The Hurst exponent is defined as follows: \mathbb{E}[B_{H}(t)B_{H}(s)]={\tfrac {1}{2}}(|t|^{2H}+|s|^{2H}-|t-s|^{2H}), \tag{100.6}

  • where:
    • B_{H}(t) is the fractional Brownian motion, and
    • H is the Hurst exponent.
  1. in ARFIMA, the autocovariance we allow the exponent of the lag k to be a real number d rather than an integer, i.e. we allow for fractional differencing via a formal binomial series expansion of the form:

\begin{array}{c} {\displaystyle {\begin{aligned}(1-B)^{d}&=\sum _{k=0}^{\infty }\;{d \choose k}\;(-B)^{k}\\&=\sum _{k=0}^{\infty }\;{\frac {\prod _{a=0}^{k-1}(d-a)\ (-B)^{k}}{k!}}\\&=1-dB+{\frac {d(d-1)}{2!}}B^{2}-\cdots \,.\end{aligned}}} \end{array} \tag{100.7}

100.6 Case Study EEG data 🎥

### EEG example ##################
source('all_dlm_functions_unknown_v.R')
source('discountfactor_selection_functions.R')

eeg_data <- scan("data/eeg.txt")
length(eeg_data)
[1] 400
eeg_data_gh <- scan("data/eeg_gh.txt")
length(eeg_data_gh)
[1] 400
plot(eeg_data, type='l', main="EEG Data", xlab="Time", ylab="EEG Signal")

head(eeg_data)
[1] -42.565610 -56.345140  -7.788712 -41.253280  33.549870 147.066900
head(eeg_data_gh)
[1] -69.553802 -83.333328 -34.776901 -68.241470   6.561679 120.078735
T <- length(eeg_data)
data <- list(yt = eeg_data[])

We know discuss our first case study type of example. In this case, we’re going to use a time-varying autoregressive model to analyze some EEG data. EEG stands for electroencephalogram data. This is a brain signal of a particular channel that was recorded on a subject that received electroconvulsive therapy. It’s just an electroencephalogram recorded at a particular location of the patient’s scalp. There is a lot of information that you can get about these data. This dataset was recorded at Duke University. There are multiple papers where we analyze these data. You can also look at a full analysis and discussions and references to the papers in the book, Prado Ferreira and West 2021. The idea here is also to provide a little bit of comparison and relationship between what we are learning here with dynamic linear models and what we learned earlier in the course with autoregressive processes. Let me just show you the data. Then before we do that, we will need here the file that contains all the functions to compute the filtering, smoothing, and forecasting distributions using an unknown v at the observational level. Then I’m putting all the functions to compute the optimal discount factor also in this other file that you will have to source this. If you open this file, you will see that it contains the functions we described before to obtain the optimal discount factors. Let’s just source those in. Here I’m providing the EEG data. It corresponds to the middle portion of an EEG that was recorded at a sampling rate of 256 hertz or 256 observations per second. The data was subsampled every fifth observation to make it a little smaller. Here I’m just plotting that dataset. What you look at here is the voltage level for the EEG data. We have these 3,600 observations. We’re going to use a model that is a time-varying autoregressive model, meaning you have an autoregressive structure in terms of a particular model order. Here we’re going to use a model order of 12. We’re going to regress yt on the 12th past values, of yt. What happens here is the AR coefficients are going to be time-varying. That’s why we are using a dynamic linear model for this particular example. The total number of observations is 3,600, I’m just going to do the analysis with the first 3,200 only and then we’ll see what the results are. I’m just going to use this first portion here of this EEG. When you look at the picture of the data, again, if you were to look at the characteristics in terms of the frequency and the amplitude, you would see that the data has a higher frequency or lower quasi-periodicity at the beginning and a higher quasi-periodicity towards the end. Then also the amplitude changes over time. Here we are just defining the data and now we are setting up the matrices. If you remember how we set up the matrices here, the Gt is an identity, and then the Ft is going to contain the past values of the yts. For each time t we are going to have 12 past values. We build that and get the Ft matrices over time and then the Gt matrix. The Gt is constant over time, the Ft changes over time. We set up the initial state matrices. The model order is 12, so we are going to have a distribution for the state parameter vector of dimension 12 that is a normal with m0 v times C0_star. Here we have just that matrix. Here this is a vector of zeros. I’m just setting everything at zero. Then the C_star matrix is a diagonal matrix of dimension 12 times 10. Each of the components has a variance of 10 here. This is conditional, again, this would be multiplied by v. The full covariance structure will be given by v times C0_star is a conditional structure. Then the parameters for the prior distribution, the number of degrees of freedom I’m assuming here is 1. Then the estimate, a priori for the observational variance is 100. We set up those. We also keep those in objects that then we can pass to the functions that compute the filtering, smoothing, and forecasting distributions. Here I’m going to specify the Wt using a discount factor so we are going to define a set of values here. I’m going to have a lower bound that is 0.95 and an upper bound that is one and then I’m going to have increments of 0.001 to define that grid of values. Just to compute the optimal discount factor. Then I’m going to use the mean square error as the measure to pick that optimal discount factor. We run this. It takes a little bit just because we have a fairly large range of grid of values. Once we compute the optimal discount factor, the function, if you remember, is going return the filtering moments or the moments for the filtering distribution, the moments for the smoothing distributions, as well as the forecasting distributions with the optimal value. If you change, you may choose a different measurement. There are, if you recall, four different measurements to compute the optimal value of the discount factor. Once this is done, I will show you the results of the analysis. Let me just show you here. The optimal discount factor that was selected here is 0.994. It’s a fairly large discount factor, meaning that the variations in the state parameters are occurring smoothly over time, but there are changes. We can now retrieve the results in terms of the filtering. distributions. We can compute the credible intervals for that. We can compute the results for the smoothing distributions and obtain credible intervals for that as well. I’m just going to plot here the results of these. This would be my data and these would be the blue that I superimpose on that, is just the mean of the smoothing distribution based on the model and this discount factor that we chose. It’s capturing the main characteristics of the data. But in this particular case, this information is not necessarily what we want to look at. We want to try to interpret what results this is giving us in terms of the time-frequency characteristics of the data. One way of obtaining that information is to think back in terms of the characteristic polynomial and the reciprocal roots of that characteristic polynomial. We are going to have a characteristic polynomial for each time t now, because we have AR coefficients that change over time. For each of those roots and for each polynomial we know that we can also obtain a spectral representation that changes over time. If you think about this spectral density that is associated to each of those autoregressive processes at time t, you can just represent that in terms of different spectral density. Let’s look at what the results are. The next piece of code is just computing those reciprocal roots of the characteristic polynomial at each time t. Then looking at what is the modulus of the reciprocal roots and what is the period of the reciprocal roots. Here we have a model order of 12. As you recall, you can have complex valued or real valued reciprocal roots, as we did in the case of the the standard auto-regressive model where the parameters are not changing over time. Because you can have a different number of roots in terms of how many pairs of complex roots you have and how many real roots you have, this model is time-varying. You could potentially have also changing number of complex valued and real valued reciprocal roots. What I’m going to plot here, is I’m going to look at the root that has the largest modulus. So the reciprocal rule that has the largest modulus over time. Then you will see here that root happens to be complex so you do get a pair of complex valued roots. There is a period that is associated to that dominant root in terms of the modulus. I’m going to plot the modulus of that root and also the periodicity. What you see here, once you run the code. This is that trajectory based on the mean of this smooth distribution for the state parameter vector. The AR coefficients at each time t here. You can see it starts very high, is very persistent at the beginning of the time period and then it oscillates essentially between 0.9. One, and you can see that it decreases towards the end of the seizure here or towards the end of the EEG recording. But this is a fairly persistent group. When you look at the corresponding frequency, and here I’m plotting that frequency in a scale that is hertz, so if you remember the original data was recorded at 256 hertz, but it was sub-sampled every fifth observation. So if I want to plot the results in terms of hertz here for that corresponding frequency, I have to do this transformation. What you see here, is that this dominant quasiperiodic component has a frequency that starts a little bit higher than five hertz, and then it decreases towards the end of the EEG recording which is something we saw also when we were plotting the data. If you were to look at the data and zoom in into different portions at the beginning of the EEG and at the end you would see that this is consistent with that behavior. In particular, neuroscientists like to talk about different frequency bands in the brain, so we see here that this dominant component is mostly in the delta frequency band for the period of recording, at the beginning is in the Theta frequency band. At the beginning, when the seizure is starting, it begins in the Theta band and it decreases to the delta band. Just to again, remind you of how this works, if I think about the spectral representation of this process over time, you can specify some time points here, and then look at what is the estimate of the spectral density that corresponds to that autoregressive process for that particular time. I’m just going to use the function, arma.spec from the library astsa, we have used this function before, and I’m just going to plot. Here we’re just plotting the spectral density estimated using again the results from the smooth distribution, so this is based on that mean of the smooth distribution at each time t, and we’re going to use the estimate of the observational variance that we get after fitting the entire data set. What we get from the dynamic linear model. Here the results, so this corresponds to just one spectral density, you see that if you were to translate this, so now the frequency appears plotted between 0.5, if you were to do the transformation to the original scale, you would see that the peak of these spectral density is precisely at that value that was around those that went from essentially five hertz down towards the end of the seizure. We can see, I’m now going to plot the different spectral densities here, so just going back just to show you the plot a little bit larger here, so I’m picking again different times during the EEG recordings, so this corresponds to the 1.96 seconds, 9.77 seconds and so on, so this is towards the beginning of the recording, and then this is towards the end of the recording. You can see that at the beginning of the recording, I have a power that is dominated mostly by a single quasiperiodic component, that is that one that we just plotted at the beginning, that begins in the Theta band. Then if you look at the location of the peak, you see that the location of this dominant peak is decreasing, so the frequency is decreasing towards the end of the recording. Also, we see that the power of these spectral densities tends to decrease also as the recording gets to the end. There are other components that are also contributing in terms of the spectral characteristics. This is the nice feature of using a time-varying autoregressive model. You can capture what’s happening with the data and what is the spectral representation, and what is the interpretation that we can give to these analyses in terms of something that neuroscientists can understand.