97 Normal Dynamic Linear Models, F.A.Q

97.1 NDLM FAQ

Everything you wanted to know about NDLMs but were afraid to ask meets Everything I would tell my younger self if ….

I found the DLM somewhat challenging. As I progressed through the material, I had some questions. And later, reading the textbooks, I found some answers but had even more questions. By the time my notes were almost complete, I was able to come up with answers to all but the most challenging. I’m not sure I got everything right, but perhaps this will be a helpful resource to other students. If you find any errors, please let me know.

As I am working through exercises from textbooks, cf. Section 97.1.2 and many questions have come up which I believe are instrumental to people unfamiliar with DLMs. Despite having learned about DLM and done some work on time series, it is still a challenge not just to understand but to even ask thoughtful questions about this topic. Some of these questions are not very smart, but more document some misconceptions I had. Also, some questions were very long, and I prefer shorter titles for the FAQ, but I kept the material more or less the same.

97.1.1 Why is this course so complicated?

Some reasons why this is complicated and how to make it simpler.

One reason I find DLM tricky is that unlike AR(p) or even ARMA(p,q) we are not dealing with a model for which we set up a Bayesian model with some \theta some priors for the thetas and pour in the data and get a posterior for the parameters, and a posterior predictive distribution for doing predictions.

NDLMs are multifaceted like diamonds:
- One facet is that they are like a multiple regression
- Another facet is that they are like a state space model
- Another facet is the use of Kalman filters in filtering and smoothing
- Another facet is that like neural networks they have an architecture and hyperparameters
- Another attractive feature is that unlike neural models they can be interpreted and can be fit with a relatively less data.
- Another facet is that the number of degrees of freedom and the number of parameters are not the same.
- Architecture:
  - They can contain a polynomial trend that is smooth
  - They can contain AR(p) components which can be jagged
  - They can contain MA(q) components which requires an augmented-state construction.
  - They can encode seasonality using dummies (jumping patterns)
  - They can encode seasonality using sinusoids (complex smooth patterns)
  - They can also incorporate time indexed regressors
  - They can incorporate geo-spatial data but this is not covered in this course or the R package and is more of an extension of the DLM framework to DSTM

We can literally add the components into one big DLM. However here things get more interesting.

Each component DLM has both an observational noise and a system noise. When we put the components into superposition In superposition there is one observation variance/covariance V_t for the combined model; component-specific evolution variances live in blocks of W_t. So when we add components, we may need to customize the variances. For the combined model; component-specific evolution variances live in blocks of W_t. So if we add components, we need to customize the variances.

A second bit that’s tricky is the model’s dimensions and how many of those are free variables. Filtering gives us estimates of the parameters we call put into the \theta but we will also need to infer the two variances.

Also in the (Prado, Ferreira, and West 2023) book the authors point out that we will also want to infer some parameters of F and G. Giving it some thought I think they are:

the degree for the polynomial trend components
the number of harmonics for the seasonal component
the p and q parameters for the ARMA components
which regressors and interaction terms to include in the regression component

In other words we may well be interested in doing additional inference for the model selection.

97.1.2 What Books are there on NDLMs?

The first two books are hard to read but they also contain a slew of exercises. Like most mathematical textbooks you won’t get more than 25% of the material unless you do enough of these exercises. I doubt you’ll even memorize the recursion equations and their interpretations unless you do enough of these. P.S. unless you are in the middle of your PhD doing the exercises is very very taxing. There are no solutions and even reading the derivation in my own solution was taxing. With enough details I could be fairly sure I had a decent answer. But it could take a couple of hours or a couple of days to get there. I feel that to a large extent I would learn more making models than derivations. It is clear that to handle missing data, handle structural change points or intervention requires great facility with the with the Kalman filtering as well as setting up the data and model so that the DLM package can do its magic. This is well beyond the level of the course.

That said lots of the material seems rather theoretical and I want to use NDLM in some sophisticated models. In hindsight I do believe this stuff isn’t as complicated to the student. Other books on the subject seem to be from Econometrics or Ecology and other applied areas exist and are likely far more accessible. There is also more software in the wild that can facilitate things like finding structural changepoint.

(West and Harrison 2013) which lays down the theory but is long winded. Often meanders rather than giving a clear and concise expiation. Perhaps the authors assume the readers will read it serval times during the PHD and a few more for their post doc. There are many gaps in this book
(Prado, Ferreira, and West 2023) is much longer, more recent, far more advanced, covers many more models, tends to reference many research papers, poorly motivated and suffers from an even greater tendency to meander rather than provide the deep insights its maven of authors clearly possess. The book has many appendices that feel more like handouts on unrelated topics with one or two result that we might have used.
(Petris, Petrone, and Campagnoli 2009) Is the most accessible of the trio. Part of the excellent Use R! series which I’ve skimmed though years and years ago. The book takes a hands on approach to using the DLM library in R. It is a text that fills in some of the gaps in the two text above. However It isn’t so easy to pick up NDLMs without background in Bayesian statistics and time series.

The following titles are books I have not read but which I noticed during my research.

(Durbin and Koopman 2012) which is a classic. It is a very mathematical book that is hard to read but it does have a lot of exercises. It is a great book for those who want to understand the mathematical foundations of the Kalman filter and its applications in time series analysis.
(Harvey 1990) is an excellent book that covers the Kalman filter and its applications in time series analysis. It is a more accessible book than Durbin and Koopman, but it is still quite mathematical. It is a great book for those who want to understand the Kalman filter and its applications in time series analysis.

97.1.3 What is a Normal Dynamic Linear Model (NDLM)?

A Normal Dynamic Linear Model (NDLM), often simply called a Dynamic Linear Model (DLM) when normality is understood, is a class of dynamic models commonly assumed to have normal (Gaussian) distributions. It is characterized for each time t by a set of quadruples \{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\}.

The core of an NDLM is defined by two sequential equations: * Observation Equation: Y_t = \mathbf{F}_t^\top \theta_t + \nu_t, where Y_t is the observation vector, \theta_t is the parameter (or state) vector, \mathbf{F}_t is a known design matrix, and \nu_t is an observational noise term assumed to be normally distributed with zero mean and known variance matrix V_t (\nu_t \sim \mathcal{N}[0, V_t]). * System Equation: \theta_t = \mathbf{G}_t \theta_{t-1} + \omega_t, where G_t is a known system evolution and \omega_t is a system noise term assumed to be normally distributed with zero mean and known variance matrix W_t (\omega_t \sim \mathcal{N}[0, W_t]).

The error sequences (\nu_t and \omega_t) are assumed to be internally independent, mutually independent, and independent of the initial information. NDLMs are widely used for modeling time series, capturing how processes change over time. Their flexibility and generality allow them to handle complex problems and directly quantify uncertainty, enabling the fitting of models with many parameters and intricate probability specifications.

97.1.4 What are these moments we keep hearing about ?

Let’s take a ~~moment~~ time out to unpack the moments reference.

When we talk about filtering we have two or three equations called the filtering equations. These have a recursive form and are often either conditionaly or directly Normal or Student-t. And the first and second moments of these distributions are fed into the next update equations. They have names and interpretation which I might cover in another question. So that is the main usage of moments but we also have priors and other distribution and they also have moments and the professor might be talking about these ones as well. But the priors are the starting point of the recursive formulas so what I explained initially is a good place for an intuition.

moments cheat-sheet

\begin{aligned} \color{RoyalBlue}{a_t} &= G_t m_{t-1} & \text{state prior mean}\\ \color{RoyalBlue}{R_t} &= G_t C_{t-1} G_t' + W_t & \text{state prior var}\\[2pt] \color{Magenta}{f_t} &= F_t' a_t & \text{1-step forecast mean}\\ \color{Magenta}{Q_t} &= F_t' R_t F_t + V_t & \text{1-step forecast var}\\[2pt] \color{BrickRed}{A_t} &= R_t F_t / Q_t & \text{Kalman gain}\\ \color{ForestGreen}{m_t} &= a_t + A_t (y_t - f_t) & \text{state post mean}\\ \color{ForestGreen}{C_t} &= R_t - A_t A_t' Q_t & \text{state post var} \end{aligned} \tag{97.1}

97.1.5 Can you explain Inference in the NDLM

We covered four cases in the notes, yet it is easier to miss the big picture.

We saw several derivations of filtering, etc, with different settings for v_t, \mathbf{W}_t

v_t and \mathbf{W}_t both known - (Normal conjugate structure)

2. v_t=v and \mathbf{W}_t - known (var and covar scaled by V , \mathcal{IG} prior for v and Student t for forecast and state)

v_t=v known and \mathbf{W}_t unknown/changing set via \delta a discount factor
v_t=v unknown and \mathbf{W}_t unknown/changing set via \delta a discount factor.

As far as I can tell from the start of (Prado, Ferreira, and West 2023, sec. 4.3) these are strong simplifying assumptions on the road to a more general case with v_t=v unknown and \mathbf{W}_t unknown. As best as I can tell, for our time series analysis, input is the same for all cases….

Can we also do away with the assumption of having constant observational variance? Isn’t it a strong assumption for us to make?

97.1.6 Can use Bayesian methods to infer the F and G of an NDLM ?

It is too easy to get hung up on the word Bayesian here. I mean it would be neat if there was an algorithm figured out F and G from the data. It would be even nicer if there was an algorithm that could recover dynamics from the data. (Phase space reconstruction algorithms are possible in chaotic systems where trajectories are dense in the phase space but if the trajectory isn’t chaotic we are out of luck)

Anyhow we get the job as modelers to setup the model and make different assumption. For a Kalman filter we need system dynamics. Using DLM entails describing these via superposition of a trend, periodicity, ARMA(p,q) and a time based regression. This is our choice regarding the inductive bias for our model and it is subjective. It reflects our view of the underlying process we are trying to model.

To sum up:

It is your job as a good Bayesian to make your assumptions explicit and to be aware of their implications. If you specify the model well you may imagine the busts if two exponents of subjective probability like de Finetti and Ramsey nodding at you in approval from their pedestals. And if you make poor choices you might hear the bust of Rudolf E. Kálmán having a fit as his filter chokes on your data.

97.1.7 Isn’t the NDLM over/under specified?

If \{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\} changes every time point t i.e. is some or all of the components of the model change we are likely to have overfitting or underfitting problems (too many/few parameters compared to the data). The Kalman Filter performs filtering and smoothing and while these operations have optimality guarantees these stabilize only if they are permitted to converge to some limit (i.e. enough steps where enough might depend on V_t, \mathbf{W}_t)

NDLM framework is very flexible, generalizing many known time series models in a single framework. In practice changes in the model at some index t should be due to:

handle NA’s,
handle intervention or
handle structural change point (These are covered in Section 97.1.23).

Missing, interventions, and structural breaks (practical cookbook)

Missing y_t:
- Skip measurement update at t: keep (m_t,C_t)=(a_t,R_t); continue with t\!+\!1

Interventions (level shift, temporary shock, ramp):
- Add a regressor to F_t with known design (step, pulse, ramp).
- Give its state component a small \delta (fast adaptation) or a spike-and-slab prior.

Structural break:
- Temporarily reduce \delta (or inflate W_t) for the block governing level/slope.
- Optionally reinitialize (m_t,C_t) for that block.
- For recurring regimes, consider switching DLMs / Markov-switching LDS.

c.f. (Harvey 1990) [Durbin and Koopman (2012) §11.5](West and Harrison 2013, Ch.11))

97.1.8 How much data do I need to fit an NDLM with k dimensions ?

The amount of data required to fit an NDLM with k dimensions depends on several factors, including the complexity of the model, the number of parameters to be estimated, and the desired level of precision in the estimates. In general, more data is needed for:

Higher Dimensionality: As the number of dimensions (k) increases, the parameter space becomes larger, requiring more data to obtain reliable estimates.
Model Complexity: More complex models with intricate structures (e.g., multiple state variables, non-linear relationships) typically require more data to accurately capture the underlying dynamics.
Desired Precision: If high precision is needed in the parameter estimates or predictions, more data will be necessary to reduce uncertainty.

A common rule of thumb is to have at least 10-20 observations per parameter to be estimated. However, this is a not a good rule for DLMs. Three full seasons might be more appropriate, and the actual data requirements may vary based on the specific context and goals of the analysis. One recommendation is tuning \delta by one-step predictive log score and checking standardized forecast errors. (West and Harrison 2013 h. 6)

97.1.9 How are filtering, smoothing, and forecasting performed in NDLMs?

NDLMs provide a coherent framework for these key time series analyses: * Filtering: This process estimates the current state of the system (θt) based on all available observations up to the current time t. In NDLMs, this is typically achieved using Kalman filter recurrences. The posterior mean of the state is a weighted average of the prior mean and the current observation, with weights proportional to their precisions. * Smoothing (Retrospective Analysis): This involves estimating past states (θs for s < t) by incorporating all available data, including future observations up to a fixed interval T. It provides a retrospective view of the parameter values based on the entire dataset. Conditional independence results are crucial for developing efficient smoothing algorithms. * Forecasting: This involves predicting the future behavior of the system state and observations (Yt+k, θt+k) for k steps ahead, given data up to the current time t. Forecast functions define the qualitative form and expected numerical development of the time series.

These processes provide linear posterior means and variances, which is a justification for their use even outside strict normality assumptions.

97.1.10 What are the computational challenges and methods used for NDLMs with unknown parameters?

When NDLM parameters, especially variances, are unknown and time-varying, obtaining exact analytical solutions for the full posterior distributions becomes complex. This necessitates the use of various computational techniques: * Approximation Techniques: * Normal Approximation: Often applied to posterior distributions, especially for parameters that are difficult to model directly. * Linearization: For models with non-linear components, linear approximations can be used to transform them into (approximate) DLMs, allowing for standard DLM analysis. * Simulation-Based Methods (Markov Chain Monte Carlo - MCMC): * General MCMC: These methods are widely used for posterior inference in complex dynamic models. * Gibbs Sampling: A common MCMC approach, it is often easily implemented for sampling posterior distributions of model parameters and state vectors within a fixed time interval, especially for conditionally linear/normal models. * Forward Filtering, Backward Sampling (FFBS): A specific and efficient MCMC algorithm introduced for sampling the full set of state vectors from the posterior distribution in conditionally Gaussian DLMs. It exploits the Markovian structure of the model. * Particle Filters (Sequential Monte Carlo - SMC): These are crucial for “on-line” or recursive inference, where new observations frequently arrive and re-running an entire MCMC every time is computationally inefficient. Particle filters are particularly well-suited for non-linear and non-Gaussian state-space models. Rao-Blackwellized particle filters can improve efficiency in conditionally Gaussian contexts.

97.1.11 How are NDLMs specified, designed?

Model Specification and Design: DLMs are often constructed by superposition (combining) two or more component DLMs, each capturing a specific feature like trend, seasonality, or regression. The starting point for model design is typically the desired forecast function, which determines the qualitative and quantitative form of the time series development.
Hierarchical Models: These are powerful extensions for problems involving multiple, related parameters. Data are modeled conditionally on parameters, which themselves are given a probabilistic specification in terms of hyperparameters. This structure allows “borrowing strength” across related groups or units, enhancing inference.

97.1.12 How are NDLMs checked for adequacy?

Model Checking (Diagnostics): Essential for assessing how well the model fits the data and substantive knowledge:
- Posterior Predictive Checks: Involve simulating replicated datasets from the model’s posterior predictive distribution and comparing them to the observed data. For NDLMs, this often includes examining standardized forecast errors (et/sqrt(Qt)), which should resemble Gaussian white noise if the model is adequate. Graphical tools like QQ-plots and empirical autocorrelation functions are used to assess normality and uncorrelatedness.
- Sensitivity Analysis: Involves recomputing posterior inferences under plausible alternative models to evaluate the robustness of conclusions to modeling assumptions.
- Model Comparison: Competing models can be evaluated based on measures like predictive accuracy (e.g., log score), information criteria (e.g., AIC, DIC, WAIC), or Bayes factors.

Diagnostics checklist

Use on standardized forecast errors

Let r_t = e_t/\sqrt{Q_t}.

White noise: ACF/PACF of r_t; Ljung–Box on r_t (not raw residuals).
Normality: QQ-plot of r_t; heavy tails ⇒ variance discounting or robust component.
Calibration: coverage of 1-step predictive intervals; PIT histogram.
Predictive score: rolling log score or CRPS for model comparison.
Breaks: spikes/variance jumps in r_t ⇒ local \delta\downarrow or add intervention regressor.

97.1.13 What is the difference between DLM and NDLM?

DLMs are the subject of (West and Harrison 2013). Though most of the time they are actually discussing NDLMs, which are a special case of DLMs in which the priors and the variance terms come from a Normal distribution. We use NDLM to emphasis this normality assumption which simplifies the analysis letting us replace Bayes rule in derivation with powerful results from Normal theory and allows for the use of conjugate priors, making the Bayesian updating process more straightforward. DLM work well with any member of the exponential family but these models are not as well explored as the Normal ones.

97.1.14 Why are \mathbf{F}_t and \mathbf{G}_t a vector and a matrix respectively?

It may helps to think about \mathbf{F} and \mathbf{G} as follows:

If we start with \mathbf{G}_t we see it is a linear transformation that describes the dynamics of state vector evolves over time. I like to think about it as a Hidden Markov state transition matrix.

And once we have the updated state \mathbf{F}_t^\top acts as a linear transformation that maps the latent state \vec{\theta}_t into the observation space, of y. While \nu_t injects some observation noise.

In the state evolution equation \theta_t = G_t\theta_{t-1}+\omega_t we pre-multiply out \theta_t \mathbf{G}_t to deterministically update the state and we then add \omega_t to account for process noise.

In other words, \mathbf{F}_t takes the current hidden state \theta_t and produces an observation y_t, while \mathbf{G}_t takes the current state and produces the next state.

97.1.15 Why is a DLM called a linear model?

This is because both the observation equation is a linear equation that relates the observations to the parameters in the model and the system equation is a linear equation that tells us how the time-varying parameter is going to be changing over time. This is why we call this a linear model.

97.1.16 Why are the noise terms \nu_t and \omega_t assumed to be normally distributed?

This is a common assumption in time series analysis. It is a convenient assumption that allows us to perform Bayesian inference and forecasting in a very simple way. And this is why we call this a normal dynamic linear model.

97.1.17 Isn’t this just a hierarchical model?

It is a hierarchical model but not just. First the observation and system evolution equations are also auto-recursive giving them a temporal structure. We have a model for the observations and a model for the system level. The system level is changing over time and the observations are related to the system level through the observation equation. As explained above G is a matrix i.e. a set of simultaneous equations and these may capture a hierarchial, multilevel or other structures.

We saw in the development of the p order polynomial trend model that we can add p levels to the evolution equation. And so it is possible to extend this model to more complex structures if we wish to do so by adding another level, etc…

However as we add more level they must be written in a representation that the KF algorithm can process.

This means we will take all these levels and fold them into G and keep the temporal structure of the two level overall framework!

One more thought on structure is that we can combine different dlm into bigger one using stacking. This isn’t something we considered before for hierarchical models so again not just.

97.1.18 What is the difference between NDLMs and AR(p)/ARIMA model?

NDLMs are built of component and one of the components can be an AR(p) model. AR(p) models need to be stationary but NDLM have no such requirement. Note that what I said above regarding AR(p) applies to an ARMA component which is a more general model than AR(p).

Intuitively though the NDLM can have seasonal and trend components and if they do account for the non stationary part of the series then the AR(p) might account for the stationary residual.

TODO: It is unclear that this does happen nor that there are guarantees that the algorithms will do this.

97.1.19 What are moments for NDLM?

The instructor and the book references parts of the NDLM model as moments what is that about? We just said in Q1 that NDLM posit a normal structure on the priors and errors. When they talk about the moments they mean the mean and variance in these distributions. These are quantities of interest. The priors obviously are known. The errors are not generally unknowns

To make things a bit clearer, the Kalman filter is optimal in some sense at assigning the state of the system at time t given the data up to time t. The state is what we call \theta_t what the Kalman filter can’t eliminate are the impact of the variance in the system and the observation level. There is always an error. However there are theoretical guarantees that the Kalman filter will provide the best linear unbiased estimate (BLUE) of the state so long as sufficient data has been seen.

The moments are the inputs and outputs of the model. We get we propagate posterior means/variances; 𝑚_t is the posterior mean, not an MLE. But we will get a posterior for the variance of the system and the observation level. This is the other moment of the model and it is here that we actually need to make strategic decisions. We can set a complex prior and get many measurements to get a good posterior for the variance or more likely we don’t really know much about the errors and we prefer to postulate something as simple as possible and then use the data to get a posterior for the variance. (Simple here means a model that only requires to compute and interpret the noise at the observation level i.e. a difference between the model’s forecast and the actual observation we see next)

97.1.20 What is this thing called \delta can I ignore it?

In the NDLM where we don’t know the system variance we can replace it under a simplifying assumption by decomposing R_t. In the filtering equations. I think of this as providing us with a “surrogate” model. I.e. a simpler model that approximate out original model.

What we do is we decompose the system model into a deterministic part and a stochastic part. We update the covariance using a reduced form which means we have a raw estimate of covariance based on how the previous times system error term evolves according to G i.e. G C_t G^\top and we use \delta as a weight to set how much of that term we want to pass through.

So to sum up: we can use the discount factor hyper-parameter denoted as \delta which we learned to estimate using MSE on a loss associated with \delta. This question is an informal outline of (specifying-the-system-covariance-matrix-via-discount-factors?)

97.1.21 What is the Kalman gain

The Kalman gain is a key component of the Kalman filter, which is used in NDLMs to update the state estimates based on new observations. It determines how much weight to give to the new observation relative to the current state estimate.

97.1.22 How does the Kalman Filter features in DLMs

The Kalman filter is an iterative algorithm driving the NDLMs. It is used for estimating the hidden states of the model and updating these estimates as new observations become available. But Kalman Filters require four matrices to do their magic and the NDLM code handles putting everything into a form which is compatible with the Kalman filter.

Unfortunately, the Kalman filter has a tendency to amplify noise, if the V_t is underestimated. Which can be problematic in practice. This means that if the model is not well-specified or if the noise characteristics change over time, the Kalman filter may produce unreliable estimates.

97.1.23 What is a structural change point.

This is a point in the time series where we need new parameters or even a more parameters. The basic structure of the model is not good moving forward.

There is a lot of criticism of the Facebook prophet algorithm breaking at some point. Digging deeper this is due to a distributional drift or more likely even a structural change point. FB prophet is

much simpler than NDLM and
uses Stan for MCMC and doesn’t use KF for its inference and
its regular users lack the ability to modify its internal components.

So it is a huge problem to fix FB prophet if it blows up in production and is used in a recommendation system. Retaining FB prophet may not work.

In contrast DLM theory emphasizes that DLMs are open to at any time point to changes at all levels. (Without worrying the student how in practice they are supposed to do this or how robust the model is to such changes) After all the Kalman filter is fantastic at using feedback to give an optimal updates. But it probably just as hard to handle a structural change, intervention or NA. I think that the extra code we got in class which takes or creates list of matrices indexed in time is exactly an NDLM in short notation that handle this. It then becomes a matter of practice in inserting NAs, interventions and adding or changing components from a certain point.

While the course and the book seems to have provided the math to do this the course lacks and explanations of how to use that part of the theory to handle these three modeling need s nor does it have exercises on the subject.

I imagine if you can do this it would be a bit like Einstein teaching the first course on general relativity and then being surprised how one of Karl Schwarzschild then got back to him within weeks with with a black hole solution for the case of a static spherically symmetric black hole.

So when you want to use this in work or research it will always be harder! You will almost always need to adjust the data or the model etc for this to work for your use case. So you should try to do a do a few of each (NAs,different levels of interventions and structural changes) so when the day comes you can say at least I have done this before and I know how to do it and how to validate it!

Getting back on topic there are state space models that incorporate identifying and switching between different models. These are called hidden Markov models (HMMs) and they are a generalization of NDLMs. HMMs can be used to model time series with structural change points, but they are more complex and require more data to estimate the parameters.

97.1.24 What does Polynomial mean in a Polynomial trend DLM ?

I was confused about this and there are three good reasons to be!

The AR(p) has a characteristic polynomial which is unrelated to do with the Polynomial in these trend models. They are AR(p) component of DLMs which create jagged forms in the forecast function, while polynomial models are popular as they add a smooth trend. The polynomial model is a sub-model that covers the trend. This allows us to model the residual as an AR(p) even if the data is non-stationary
In (West and Harrison 2013) the authors talks about Taylor series approximation before delving into the first polynomial trend model. This is an intuitive way to think about about multiple regression model. This is a polynomial in which we can pick the order one or two to get better approximations lingo common in physics and numerical methods. However it is completely unrelated to the Polynomial that gives the model its name.

For a polynomial model of order p, when we multiply out the terms of the forecast function we get a polynomial of order p-1.

97.1.25 Can NDLMs handle unknown and non-constant observational (Vt) and system (Wt) variances? {sec-faq-unknown-variances}

Yes, NDLMs can be extended to handle cases where both observational variance (Vt) and system variance (Wt) are unknown and vary over time. This moves beyond the simplest Kalman filter assumptions of known variances.

Here’s how this is typically managed: * Unknown but Constant Variances: * If the observational variance V is unknown but constant, it can be estimated using a conjugate Normal-Gamma prior for the state vector and observational precision (inverse variance). This results in the state and predictive distributions following Student-t distributions rather than normal distributions. * If the system variance \mathbf{W}_t is unknown but constant, it can be estimated using methods like the EM algorithm. * Time-Varying System Variance (Wt): * The most common and practical approach is through discount factors (δ). A discount factor defines \mathbf{W}_t as a proportion of the previous time step’s prior covariance, effectively R_t=\delta^{-1}G_tC_{t-1}G_t',\qquad W_t=\frac{1-\delta}{\delta}\,G_tC_{t-1}G_t. This allows \mathbf{W}_t to be automatically time-varying and adaptive, simplifying the specification of complex covariance elements to a single scalar. Different discount factors can be applied to different components of the state vector. * Time-Varying Observational Variance (Vt): * This is handled through variance discounting or discounted variance learning. This technique models a decay of information about the observational precision (φ_t = \frac{1}{V_t}) over time, maintaining the conjugate Gamma distribution form for precision. The prior for φ_t at time t is derived by discounting the degrees of freedom and scale parameter from the previous posterior (e.g., G[δnt-1/2, δdt-1/2] ). This makes the variance estimate more adaptive to recent data. The concept of “power-discounting” is also mentioned in relation to modifying the prior distribution, suggesting a general method for flattening distributions, which can be applied to precision parameters. * Multivariate Extensions: These discounting approaches extend to multivariate DLMs. For instance, matrix normal/Inverse Wishart distributions can be used to handle time-varying observational covariance matrices (Σt), often with dynamics defined by a matrix beta evolution model.

State discounting (\delta) for W_t

For scalar \delta\in(0,1] applied to the state evolution: \begin{aligned} R_t &= \delta^{-1}\,G_t C_{t-1} G_t' \\ W_t &= R_t - G_t C_{t-1} G_t' \;=\; \tfrac{1-\delta}{\delta}\,G_t C_{t-1} G_t'. \end{aligned}

Usage.

Smaller \delta ⇒ larger W_t ⇒ faster adaptation.
Use block discounts (different \delta) per component of \theta_t.
For breaks/interventions: temporarily set \delta\!\ll\!1 on affected blocks.

(West and Harrison 2013, Ch.6) (Prado, Ferreira, and West 2023, sec. 4) (Durbin and Koopman 2012, sec. 2)

Variance discounting (\beta) for observation variance V_t

Discount the precision \phi_t=V_t^{-1} prior to keep \mathbb E[\phi_t] fixed while inflating uncertainty.

If \phi_{t-1}\sim\mathrm{Gamma}(a_{t-1}, b_{t-1}) (shape–rate), set [ tD{t-1}(a_{t-1},; b_{t-1}), (0,1]. ] Then \mathbb E[\phi_t]=a_{t-1}/b_{t-1} (unchanged) and
\mathrm{Var}(\phi_t) = (1/\beta)\,a_{t-1}/b_{t-1}^2 (inflated).

Effect. Forecasts remain Student-(t); recent data get more weight.
(Refs: West and Harrison (2013) §10.8; Prado, Ferreira, and West (2023) §4)

97.1.26 What are some common extensions and generalizations of NDLMs?

The DLM framework is highly flexible and can be extended in various ways: * Non-Normal and Non-Linear Dynamic Models: * Dynamic Generalized Linear Models (DGLMs): Extend DLMs by using exponential family distributions (e.g., Poisson for count data, Binomial for proportions) for the observational model, often involving non-linear link functions. * General Non-Linear Dynamic Models: Arise when parameters (e.g., λ in a transfer response function or a discount factor itself) introduce non-linearities into the system or observation equations. * Stochastic Volatility (SV) Models: Often formulated as non-linear/non-Gaussian state-space models where volatility parameters evolve dynamically, requiring specialized computational methods. * Mixture Models: Can be incorporated to handle non-normal error distributions or to model phenomena like occasional outliers. * Multivariate and Matrix Normal DLMs: * Multivariate DLMs: Generalize to handle vector-valued observations, allowing for joint modeling of multiple time series. * Matrix Normal DLMs: Provide a framework for multivariate time series analysis where the covariance structure across series is unknown, leveraging matrix-variate normal distributions for fully conjugate analyses. * Dynamic Graphical Models: Combine matrix-variate DLMs with Gaussian graphical models, allowing for structured and often sparse precision matrices, which is useful for scalability in high-dimensional time series. * Dynamic Dependence Network Models (DDNMs): These models define multivariate dynamic models by coupling customized univariate DLMs, extending time-varying vector autoregressive (TV-VAR) models and allowing for flexible modeling of time-varying parameters and volatilities. * Spatio-Temporal Models: NDLMs form the foundation for dynamic spatio-temporal models (DSTMs), which model processes that vary across both space and time. These models can also incorporate non-linearity and non-Gaussian elements. Hidden Resolution Models (HRMs) are a type of multiscale time series model that can be formulated as DLMs.

97.1.27 The Normal Dynamic Linear Model: Definition, Model classes & The Superposition Principle

Dynamic Linear Models (DLMs) extend classical linear regression to time-indexed data, introducing dependencies between observations through latent evolving parameters. A Normal DLM (NDLM) assumes Gaussian noise at both observation and system levels, enabling tractable Bayesian inference through the Kalman filter.

While superficially complex, NDLMs are conceptually close to linear regression. Instead of I.I.D. observations indexed by i, we index data by time t and allow parameters to evolve with time, resulting in a two-level hierarchical model. At the top level is the observation equation. Below this there is the evolution equation(s) that can be understood as a latent state transition model that can capture trends, periodicity, and regression. The evolution equations can have more than one level however we will see that with some work these are summarized into a matrix form.

To make things simpler this is demonstrated using a white noise process and then a random walk model. What makes the NDLM somewhat different is that that there are two variance elements at two levels, necessitating learning more parameters. Once we cover these to models the instructor walks us though all the bits and pieces of the notation. Later we will see that we can add trends, periodicity, regression components in a more or less systematic way. However we need to pick and choose these components to get a suitable forecast function. This approach require an intimate familiarity with the data generating process to model.

This approach is Bayesian in that we draw our parameters from a multivariate normal and use updating to improve this initial estimate by incorporating the data and we end up with a posterior i.e. we have distributional view of the time series incorporating uncertainties. Additionally we have a number of Bayesian quantities that can be derived from the model, such as

the filtering distribution that estimates the current state \mathbb{P}r(\theta_t \mid \mathcal{D}_t),
the forecasting distribution - to predict future observation: \mathbb{P}r(y_{t+h} \mid \mathcal{D}_t),
the smoothing distribution - retrospective estimate of past state: \mathbb{P}r(\theta_t \mid \mathcal{D}_{T})\quad t<T and
the forecast function when F_t=F and \mathbf{G}_t=\mathbf{G} f_t(h)=\mathbb{E}[y_{t+h} \mid \mathcal{D}_{T}] = F'G^h \mathbb{E}[\theta_{t} \mid \mathcal{D}_{T}]
the usual credible intervals for forecasts and parameter estimates.

However the DLM framework is quite flexible and once you understand it it can be adapted to support features like seasonality using the superposition principle. NDLMs don’t need to be non-stationary time series.

As far as I cen tell NDLMs are just DLM with their errors distributed normally at the different levels.

--- date: 2024-11-06 title: "Normal Dynamic Linear Models, F.A.Q" subtitle: Time Series Analysis description: "Normal Dynamic Linear Models (NDLMs) are a class of models used for time series analysis that allow for flexible modeling of temporal dependencies." categories: - Bayesian Statistics - Time Series keywords: - Time Series - Filtering - Kalman filtering - Smoothing - NDLM - Normal Dynamic Linear Models - Polynomial Trend Models - Regression Models - Superposition Principle - R code fig-caption: Notes about ... Bayesian Statistics title-block-banner: images/banner_deep.jpg --- ## NDLM FAQ {#sec-ndlm-faq} **Everything you wanted to know about NDLMs but were afraid to ask** meets **Everything I would tell my younger self if ....** I found the DLM somewhat challenging. As I progressed through the material, I had some questions. And later, reading the textbooks, I found some answers but had even more questions. By the time my notes were almost complete, I was able to come up with answers to all but the most challenging. I'm not sure I got everything right, but perhaps this will be a helpful resource to other students. If you find any errors, please let me know. As I am working through exercises from textbooks, cf. [@sec-faq-books] and many questions have come up which I believe are instrumental to people unfamiliar with DLMs. Despite having learned about DLM and done some work on time series, it is still a challenge not just to understand but to even ask thoughtful questions about this topic. Some of these questions are not very smart, but more document some misconceptions I had. Also, some questions were very long, and I prefer shorter titles for the FAQ, but I kept the material more or less the same. --- ### Why is this course so complicated? Some reasons why this is complicated and how to make it simpler. One reason I find DLM tricky is that unlike AR($p$) or even ARMA($p,q$) we are not dealing with a model for which we set up a Bayesian model with some $\theta$ some priors for the thetas and pour in the data and get a posterior for the parameters, and a posterior predictive distribution for doing predictions. - NDLMs are multifaceted like diamonds: - One facet is that they are like a multiple regression - Another facet is that they are like a state space model - Another facet is the use of Kalman filters in filtering and smoothing - Another facet is that like neural networks they have an architecture and hyperparameters - Another attractive feature is that unlike neural models they can be interpreted and can be fit with a relatively less data. - Another facet is that the number of degrees of freedom and the number of parameters are not the same. - Architecture: - They can contain a polynomial trend that is smooth - They can contain AR($p$) components which can be jagged - They can contain MA($q$) components which requires an augmented-state construction. - They can encode seasonality using dummies (jumping patterns) - They can encode seasonality using sinusoids (complex smooth patterns) - They can also incorporate time indexed regressors - They can incorporate geo-spatial data but this is not covered in this course or the `R` package and is more of an extension of the DLM framework to [DSTM](https://spacetimewithr.org/) We can literally add the components into one big DLM. However here things get more interesting. Each component DLM has both an observational noise and a system noise. When we put the components into superposition In superposition there is one observation variance/covariance $V_t$ for the combined model; component-specific evolution variances live in blocks of $W_t$. So when we add components, we may need to customize the variances. For the combined model; component-specific evolution variances live in blocks of $W_t$. So if we add components, we need to customize the variances. A second bit that's tricky is the model's dimensions and how many of those are free variables. Filtering gives us estimates of the parameters we call put into the $\theta$ but we will also need to infer the two variances. Also in the [@prado2023time] book the authors point out that we will also want to infer some parameters of $F$ and $G$. Giving it some thought I think they are: - the degree for the polynomial trend components - the number of harmonics for the seasonal component - the p and q parameters for the ARMA components - which regressors and interaction terms to include in the regression component In other words we may well be interested in doing additional inference for the model selection. --- ### What Books are there on NDLMs? {#sec-faq-books} The first two books are hard to read but they also contain a slew of exercises. Like most mathematical textbooks you won't get more than 25% of the material unless you do enough of these exercises. I doubt you'll even memorize the recursion equations and their interpretations unless you do enough of these. P.S. unless you are in the middle of your PhD doing the exercises is very very taxing. **There are no solutions and even reading the derivation in my own solution was taxing**. With enough details I could be fairly sure I had a decent answer. But it could take a couple of hours or a couple of days to get there. I feel that to a large extent I would learn more making models than derivations. It is clear that to handle missing data, handle structural change points or intervention requires great facility with the with the Kalman filtering as well as setting up the data and model so that the DLM package can do its magic. This is well beyond the level of the course. That said lots of the material seems rather theoretical and I want to use NDLM in some sophisticated models. In hindsight I do believe this stuff isn't as complicated to the student. Other books on the subject seem to be from Econometrics or Ecology and other applied areas exist and are likely far more accessible. There is also more software in the wild that can facilitate things like finding structural changepoint. - [@west2013bayesian] which lays down the theory but is long winded. Often meanders rather than giving a clear and concise expiation. Perhaps the authors assume the readers will read it serval times during the PHD and a few more for their post doc. There are many gaps in this book - [@prado2023time] is much longer, more recent, far more advanced, covers many more models, tends to reference many research papers, poorly motivated and suffers from an even greater tendency to meander rather than provide the deep insights its maven of authors clearly possess. The book has many appendices that feel more like handouts on unrelated topics with one or two result that we might have used. - [@petris2009dynamic] Is the most accessible of the trio. Part of the excellent Use R! series which I've skimmed though years and years ago. The book takes a hands on approach to using the DLM library in R. It is a text that fills in some of the gaps in the two text above. However It isn't so easy to pick up NDLMs without background in Bayesian statistics and time series. The following titles are books I have not read but which I noticed during my research. - [@DurbinKoopman2012TSABSS] which is a classic. It is a very mathematical book that is hard to read but it does have a lot of exercises. It is a great book for those who want to understand the mathematical foundations of the Kalman filter and its applications in time series analysis. - [@harvey1990forecasting] is an excellent book that covers the Kalman filter and its applications in time series analysis. It is a more accessible book than Durbin and Koopman, but it is still quite mathematical. It is a great book for those who want to understand the Kalman filter and its applications in time series analysis. --- ### What is a Normal Dynamic Linear Model (NDLM)? {#sec-faq-ndlm} A Normal Dynamic Linear Model (NDLM), often simply called a Dynamic Linear Model (DLM) when normality is understood, is a class of dynamic models commonly assumed to have normal (Gaussian) distributions. It is characterized for each time `t` by a set of **quadruples $\{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\}$**. The core of an NDLM is defined by two sequential equations: * **Observation Equation**: $Y_t = \mathbf{F}_t^\top \theta_t + \nu_t$, where $Y_t$ is the observation vector, $\theta_t$ is the parameter (or state) vector, $\mathbf{F}_t$ is a known design matrix, and $\nu_t$ is an observational noise term assumed to be **normally distributed with zero mean and known variance matrix $V_t$** ($\nu_t \sim \mathcal{N}[0, V_t]$). * **System Equation**: $\theta_t = \mathbf{G}_t \theta_{t-1} + \omega_t$, where $G_t$ is a known system evolution and $\omega_t$ is a system noise term assumed to be **normally distributed with zero mean and known variance matrix $W_t$** ($\omega_t \sim \mathcal{N}[0, W_t]$). The error sequences ($\nu_t$ and $\omega_t$) are assumed to be internally independent, mutually independent, and independent of the initial information. NDLMs are widely used for modeling time series, capturing how processes change over time. Their flexibility and generality allow them to handle complex problems and directly quantify uncertainty, enabling the fitting of models with many parameters and intricate probability specifications. --- ### What are these **moments** we keep hearing about ? {#sec-faq-moments} Let's take a ~~moment~~ *time out* to unpack the moments reference. When we talk about filtering we have two or three equations called the filtering equations. These have a recursive form and are often either conditionaly or directly Normal or Student-t. And the first and second moments of these distributions are fed into the next update equations. They have names and interpretation which I might cover in another question. So that is the main usage of moments but we also have priors and other distribution and they also have moments and the professor might be talking about these ones as well. But the priors are the starting point of the recursive formulas so what I explained initially is a good place for an intuition. ::: {.callout-tip} #### moments cheat-sheet $$ \begin{aligned} \color{RoyalBlue}{a_t} &= G_t m_{t-1} & \text{state prior mean}\\ \color{RoyalBlue}{R_t} &= G_t C_{t-1} G_t' + W_t & \text{state prior var}\\[2pt] \color{Magenta}{f_t} &= F_t' a_t & \text{1-step forecast mean}\\ \color{Magenta}{Q_t} &= F_t' R_t F_t + V_t & \text{1-step forecast var}\\[2pt] \color{BrickRed}{A_t} &= R_t F_t / Q_t & \text{Kalman gain}\\ \color{ForestGreen}{m_t} &= a_t + A_t (y_t - f_t) & \text{state post mean}\\ \color{ForestGreen}{C_t} &= R_t - A_t A_t' Q_t & \text{state post var} \end{aligned} $$ {#eq-moments-cheatsheet} ::: --- ### Can you explain Inference in the NDLM We covered four cases in the notes, yet it is easier to miss the big picture. We saw several derivations of filtering, etc, with different settings for $v_t, \mathbf{W}_t$ 1. $v_t$ and $\mathbf{W}_t$ both known - (Normal conjugate structure) 2. $v_t=v$ and $\mathbf{W}_t$ - known (var and covar scaled by $V$ , $\mathcal{IG}$ prior for $v$ and Student $t$ for forecast and state) 3. $v_t=v$ known and $\mathbf{W}_t$ unknown/changing set via $\delta$ a discount factor 4. $v_t=v$ unknown and $\mathbf{W}_t$ unknown/changing set via $\delta$ a discount factor. As far as I can tell from the start of [@prado2023time Sec. 4.3] these are strong simplifying assumptions on the road to a more general case with $v_t=v$ unknown and $\mathbf{W}_t$ unknown. As best as I can tell, for our time series analysis, input is the same for all cases.... Can we also do away with the assumption of having constant observational variance? Isn't it a strong assumption for us to make? ### Can use Bayesian methods to infer the $F$ and $G$ of an NDLM ? It is too easy to get hung up on the word Bayesian here. I mean it would be neat if there was an algorithm figured out F and G from the data. It would be even nicer if there was an algorithm that could recover dynamics from the data. (Phase space reconstruction algorithms are possible in chaotic systems where trajectories are dense in the phase space but if the trajectory isn't chaotic we are out of luck) Anyhow we get the job as modelers to setup the model and make different assumption. For a Kalman filter we need system dynamics. Using DLM entails describing these via superposition of a trend, periodicity, ARMA($p,q$) and a time based regression. This is our choice regarding the inductive bias for our model and it is subjective. It reflects our view of the underlying process we are trying to model. To sum up: It is your job as a good Bayesian to make your assumptions explicit and to be aware of their implications. If you specify the model well you may imagine the busts if two exponents of subjective probability like de Finetti and Ramsey nodding at you in approval from their pedestals. And if you make poor choices you might hear the bust of Rudolf E. Kálmán having a fit as his filter chokes on your data. ### Isn't the NDLM over/under specified? {#sec-faq-overspecified} If $\{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\}$ changes every time point $t$ i.e. is some or all of the components of the model change we are likely to have overfitting or underfitting problems (too many/few parameters compared to the data). The Kalman Filter performs filtering and smoothing and while these operations have optimality guarantees these stabilize only if they are permitted to converge to some limit (i.e. enough steps where enough might depend on $V_t, \mathbf{W}_t$) NDLM framework is very flexible, generalizing many known time series models in a single framework. In practice changes in the model at some index $t$ should be due to: - handle NA's, - handle intervention or - handle structural change point (These are covered in [@sec-faq-structural-change-point]). ::: {.callout-tip} #### Missing, interventions, and structural breaks (practical cookbook) {.unnumbered} **Missing $y_t$:** - Skip measurement update at $t$: keep $(m_t,C_t)=(a_t,R_t)$; continue with $t\!+\!1$ **Interventions (level shift, temporary shock, ramp):** - Add a regressor to $F_t$ with known design (step, pulse, ramp). - Give its state component a small $\delta$ (fast adaptation) or a spike-and-slab prior. **Structural break:** - Temporarily reduce $\delta$ (or inflate $W_t$) for the block governing level/slope. - Optionally reinitialize $(m_t,C_t)$ for that block. - For recurring regimes, consider **switching DLMs / Markov-switching LDS**. c.f. [@harvey1990forecasting] [@DurbinKoopman2012TSABSS §11.5][@west2013bayesian Ch.11]) ::: --- ### How much data do I need to fit an NDLM with k dimensions ?{#sec-faq-data-requirements} The amount of data required to fit an NDLM with $k$ dimensions depends on several factors, including the complexity of the model, the number of parameters to be estimated, and the desired level of precision in the estimates. In general, more data is needed for: 1. **Higher Dimensionality**: As the number of dimensions ($k$) increases, the parameter space becomes larger, requiring more data to obtain reliable estimates. 2. **Model Complexity**: More complex models with intricate structures (e.g., multiple state variables, non-linear relationships) typically require more data to accurately capture the underlying dynamics. 3. **Desired Precision**: If high precision is needed in the parameter estimates or predictions, more data will be necessary to reduce uncertainty. A common rule of thumb is to have at least 10-20 observations per parameter to be estimated. However, this is a not a good rule for DLMs. Three full seasons might be more appropriate, and the actual data requirements may vary based on the specific context and goals of the analysis. One recommendation is tuning $\delta$ by one-step predictive log score and checking standardized forecast errors. [@west2013bayesian h. 6] --- ### How are filtering, smoothing, and forecasting performed in NDLMs? {#sec-faq-filtering-smoothing-forecasting} NDLMs provide a coherent framework for these key time series analyses: * **Filtering**: This process estimates the current state of the system (`θt`) based on all available observations up to the current time `t`. In NDLMs, this is typically achieved using **Kalman filter recurrences**. The posterior mean of the state is a weighted average of the prior mean and the current observation, with weights proportional to their precisions. * **Smoothing (Retrospective Analysis)**: This involves estimating past states (`θs` for `s < t`) by incorporating all available data, including future observations up to a fixed interval `T`. It provides a retrospective view of the parameter values based on the entire dataset. Conditional independence results are crucial for developing efficient smoothing algorithms. * **Forecasting**: This involves predicting the future behavior of the system state and observations (`Yt+k`, `θt+k`) for `k` steps ahead, given data up to the current time `t`. Forecast functions define the qualitative form and expected numerical development of the time series. These processes provide linear posterior means and variances, which is a justification for their use even outside strict normality assumptions. --- ### What are the computational challenges and methods used for NDLMs with unknown parameters? {#sec-faq-computational} When NDLM parameters, especially variances, are unknown and time-varying, obtaining exact analytical solutions for the full posterior distributions becomes complex. This necessitates the use of various computational techniques: * **Approximation Techniques**: * **Normal Approximation**: Often applied to posterior distributions, especially for parameters that are difficult to model directly. * **Linearization**: For models with non-linear components, linear approximations can be used to transform them into (approximate) DLMs, allowing for standard DLM analysis. * **Simulation-Based Methods (Markov Chain Monte Carlo - MCMC)**: * **General MCMC**: These methods are widely used for posterior inference in complex dynamic models. * **Gibbs Sampling**: A common MCMC approach, it is often easily implemented for sampling posterior distributions of model parameters and state vectors within a fixed time interval, especially for conditionally linear/normal models. * **Forward Filtering, Backward Sampling (FFBS)**: A specific and efficient MCMC algorithm introduced for sampling the full set of state vectors from the posterior distribution in conditionally Gaussian DLMs. It exploits the Markovian structure of the model. * **Particle Filters (Sequential Monte Carlo - SMC)**: These are crucial for "on-line" or recursive inference, where new observations frequently arrive and re-running an entire MCMC every time is computationally inefficient. Particle filters are particularly well-suited for non-linear and non-Gaussian state-space models. Rao-Blackwellized particle filters can improve efficiency in conditionally Gaussian contexts. --- ### How are NDLMs specified, designed? {#sec-faq-specification} * **Model Specification and Design**: DLMs are often constructed by **superposition** (combining) two or more **component DLMs**, each capturing a specific feature like trend, seasonality, or regression. The starting point for model design is typically the desired **forecast function**, which determines the qualitative and quantitative form of the time series development. * **Hierarchical Models**: These are powerful extensions for problems involving multiple, related parameters. Data are modeled conditionally on parameters, which themselves are given a probabilistic specification in terms of hyperparameters. This structure allows "borrowing strength" across related groups or units, enhancing inference. --- ### How are NDLMs checked for adequacy? {#sec-faq-checking} * **Model Checking (Diagnostics)**: Essential for assessing how well the model fits the data and substantive knowledge: * **Posterior Predictive Checks**: Involve simulating replicated datasets from the model's posterior predictive distribution and comparing them to the observed data. For NDLMs, this often includes examining **standardized forecast errors** (`et/sqrt(Qt)`), which should resemble Gaussian white noise if the model is adequate. Graphical tools like QQ-plots and empirical autocorrelation functions are used to assess normality and uncorrelatedness. * **Sensitivity Analysis**: Involves recomputing posterior inferences under plausible alternative models to evaluate the robustness of conclusions to modeling assumptions. * **Model Comparison**: Competing models can be evaluated based on measures like **predictive accuracy** (e.g., log score), **information criteria** (e.g., AIC, DIC, WAIC), or **Bayes factors**. ::: {.callout-warning #box-diagnostics} #### Diagnostics checklist {.unnumbered} Use on standardized forecast errors Let $r_t = e_t/\sqrt{Q_t}$. - **White noise:** ACF/PACF of $r_t$; Ljung–Box on $r_t$ (not raw residuals). - **Normality:** QQ-plot of $r_t$; heavy tails ⇒ variance discounting or robust component. - **Calibration:** coverage of 1-step predictive intervals; PIT histogram. - **Predictive score:** rolling log score or CRPS for model comparison. - **Breaks:** spikes/variance jumps in $r_t$ ⇒ local $\delta\downarrow$ or add intervention regressor. ::: --- ### What is the difference between DLM and NDLM? {#sec-faq-dlm-vs-ndlm} DLMs are the subject of [@west2013bayesian]. Though most of the time they are actually discussing NDLMs, which are a special case of DLMs in which the priors and the variance terms come from a Normal distribution. We use NDLM to emphasis this normality assumption which simplifies the analysis letting us replace Bayes rule in derivation with powerful results from *Normal theory* and allows for the use of conjugate priors, making the Bayesian updating process more straightforward. DLM work well with any member of the exponential family but these models are not as well explored as the Normal ones. ### Why are $\mathbf{F}_t$ and $\mathbf{G}_t$ a vector and a matrix respectively? {#sec-faq-Ft-Gt} It may helps to think about $\mathbf{F}$ and $\mathbf{G}$ as follows: If we start with $\mathbf{G}_t$ we see it is a linear transformation that describes the dynamics of state vector evolves over time. I like to think about it as a Hidden Markov state transition matrix. And once we have the updated state $\mathbf{F}_t^\top$ acts as a linear transformation that maps the latent state $\vec{\theta}_t$ into the observation space, of $y$. While $\nu_t$ injects some observation noise. In the state evolution equation $\theta_t = G_t\theta_{t-1}+\omega_t$ we pre-multiply out $\theta_t \mathbf{G}_t$ to deterministically update the state and we then add $\omega_t$ to account for process noise. In other words, $\mathbf{F}_t$ takes the current hidden state $\theta_t$ and produces an observation $y_t$, while $\mathbf{G}_t$ takes the current state and produces the next state. --- ### Why is a DLM called a linear model? {#sec-faq-linear-model} This is because both the observation equation is a linear equation that relates the observations to the parameters in the model and the system equation is a linear equation that tells us how the time-varying parameter is going to be changing over time. This is why we call this a linear model. --- ### Why are the noise terms $\nu_t$ and $\omega_t$ assumed to be normally distributed? {#sec-faq-noise-normal} This is a common assumption in time series analysis. It is a convenient assumption that allows us to perform Bayesian inference and forecasting in a very simple way. And this is why we call this a **normal** dynamic linear model. ### Isn't this just a hierarchical model? {#sec-faq-hierarchical} It is a hierarchical model but **not just**. First the observation and system evolution equations are also auto-recursive giving them a temporal structure. We have a model for the observations and a model for the system level. The system level is changing over time and the observations are related to the system level through the observation equation. As explained above $G$ is a matrix i.e. a set of simultaneous equations and these may capture a hierarchial, multilevel or other structures. We saw in the development of the p order polynomial trend model that we can add p levels to the evolution equation. And so it is possible to extend this model to more complex structures if we wish to do so by adding another level, etc... However as we add more level they must be written in a representation that the KF algorithm can process. This means we will take all these levels and fold them into $G$ and keep the temporal structure of the two level overall framework! One more thought on structure is that we can combine different dlm into bigger one using stacking. This isn't something we considered before for hierarchical models so again **not just**. --- ### What is the difference between NDLMs and AR(p)/ARIMA model? {#sec-faq-ndlm-arp} NDLMs are built of component and one of the components can be an AR($p$) model. AR($p$) models need to be stationary but NDLM have no such requirement. Note that what I said above regarding AR($p$) applies to an ARMA component which is a more general model than AR($p$). - Intuitively though the NDLM can have seasonal and trend components and if they do account for the non stationary part of the series then the AR($p$) might account for the stationary residual. TODO: It is unclear that this does happen nor that there are guarantees that the algorithms will do this. --- ### What are moments for NDLM? {#sec-faq-moments-ndlm} The instructor and the book references parts of the NDLM model as moments what is that about? We just said in [Q1]{#sec-dlm-vs-ndlm} that NDLM posit a normal structure on the priors and errors. When they talk about the moments they mean the mean and variance in these distributions. These are quantities of interest. The priors obviously are known. The errors are not generally unknowns To make things a bit clearer, the Kalman filter is optimal in some sense at assigning the state of the system at time $t$ given the data up to time $t$. The state is what we call $\theta_t$ what the Kalman filter can't eliminate are the impact of the variance in the system and the observation level. There is always an error. However there are theoretical guarantees that the Kalman filter will provide the best linear unbiased estimate (BLUE) of the state so long as sufficient data has been seen. The moments are the inputs and outputs of the model. We get we propagate posterior means/variances; $𝑚_t$ is the posterior mean, not an MLE. But we will get a posterior for the variance of the system and the observation level. This is the other moment of the model and it is here that we actually need to make strategic decisions. We can set a complex prior and get many measurements to get a good posterior for the variance or more likely we don't really know much about the errors and we prefer to postulate something as simple as possible and then use the data to get a posterior for the variance. (Simple here means a model that only requires to compute and interpret the noise at the observation level i.e. a difference between the model's forecast and the actual observation we see next) --- ### What is this thing called $\delta$ can I ignore it? In the NDLM where we don't know the system variance we can replace it under a simplifying assumption by decomposing $R_t$. In the filtering equations. I think of this as providing us with a "surrogate" model. I.e. a simpler model that approximate out original model. What we do is we decompose the system model into a deterministic part and a stochastic part. We update the covariance using a `reduced form` which means we have a raw estimate of covariance based on how the previous times system error term evolves according to $G$ i.e. $G C_t G^\top$ and we use $\delta$ as a weight to set how much of that term we want to pass through. So to sum up: we can use the discount factor hyper-parameter denoted as $\delta$ which we learned to estimate using MSE on a loss associated with $\delta$. This question is an informal outline of [@specifying-the-system-covariance-matrix-via-discount-factors] --- ### What is the Kalman gain The Kalman gain is a key component of the Kalman filter, which is used in NDLMs to update the state estimates based on new observations. It determines how much weight to give to the new observation relative to the current state estimate. --- ### How does the Kalman Filter features in DLMs {#sec-faq-Kalman-filter} The Kalman filter is an iterative algorithm driving the NDLMs. It is used for estimating the hidden states of the model and updating these estimates as new observations become available. But Kalman Filters require four matrices to do their magic and the NDLM code handles putting everything into a form which is compatible with the Kalman filter. Unfortunately, [the Kalman filter has a tendency to amplify noise]{.mark}, if the $V_t$ is underestimated. Which can be problematic in practice. This means that if the model is not well-specified or if the noise characteristics change over time, the Kalman filter may produce unreliable estimates. --- ### What is a structural change point. {#sec-faq-structural-change-point} This is a point in the time series where we need new parameters or even a more parameters. The basic structure of the model is not good moving forward. There is a lot of criticism of the Facebook prophet algorithm breaking at some point. Digging deeper this is due to a distributional drift or more likely even a structural change point. FB prophet is - much simpler than NDLM and - uses Stan for MCMC and doesn't use KF for its inference and - its regular users lack the ability to modify its internal components. So it is a huge problem to fix FB prophet if it blows up in production and is used in a recommendation system. Retaining FB prophet may not work. In contrast DLM theory emphasizes that DLMs are open to at any time point to changes at all levels. (Without worrying the student how in practice they are supposed to do this or how robust the model is to such changes) After all the Kalman filter is fantastic at using feedback to give an optimal updates. But it probably just as hard to handle a structural change, intervention or NA. I think that the extra code we got in class which takes or creates list of matrices indexed in time is exactly an NDLM in short notation that handle this. It then becomes a matter of practice in inserting NAs, interventions and adding or changing components from a certain point. While the course and the book seems to have provided the math to do this the course lacks and explanations of how to use that part of the theory to handle these three modeling need s nor does it have exercises on the subject. I imagine if you can do this it would be a bit like Einstein teaching the first course on general relativity and then being surprised how one of Karl Schwarzschild then got back to him within weeks with with a black hole solution for the case of a static spherically symmetric black hole. So when you want to use this in work or research it will always be harder! You will almost always need to adjust the data or the model etc for this to work for your use case. So you should try to do a do a few of each (NAs,different levels of interventions and structural changes) so when the day comes you can say at least I have done this before and I know how to do it and how to validate it! Getting back on topic there are state space models that incorporate identifying and switching between different models. These are called **hidden Markov models** (HMMs) and they are a generalization of NDLMs. HMMs can be used to model time series with structural change points, but they are more complex and require more data to estimate the parameters. --- ### What does Polynomial mean in a Polynomial trend DLM ? {#faw-faq-polynomial} I was confused about this and there are three good reasons to be! 1. The AR($p$) has a **characteristic polynomial which is unrelated** to do with the Polynomial in these trend models. They are AR($p$) component of DLMs which create jagged forms in the forecast function, while polynomial models are popular as they add a smooth trend. The polynomial model is a sub-model that covers the trend. This allows us to model the residual as an AR($p$) even if the data is non-stationary 2. In [@west2013bayesian] the authors talks about Taylor series approximation before delving into the first polynomial trend model. This is an intuitive way to think about about multiple regression model. This is a polynomial in which we can pick the order one or two to get better approximations lingo common in physics and numerical methods. However it is **completely unrelated** to the Polynomial that gives the model its name. For a polynomial model of order $p$, when we multiply out the terms of the forecast function we get a polynomial of order $p-1$. --- ### Can NDLMs handle unknown and non-constant observational (Vt) and system (Wt) variances? {sec-faq-unknown-variances} **Yes, NDLMs can be extended to handle cases where both observational variance (Vt) and system variance (Wt) are unknown and vary over time**. This moves beyond the simplest Kalman filter assumptions of known variances. Here's how this is typically managed: * **Unknown but Constant Variances**: * If the observational variance $V$ is unknown but constant, it can be estimated using a conjugate **Normal-Gamma prior** for the state vector and observational precision (inverse variance). This results in the state and predictive distributions following **Student-t distributions** rather than normal distributions. * If the system variance $\mathbf{W}_t$ is unknown but constant, it can be estimated using methods like the **EM algorithm**. * **Time-Varying System Variance (Wt)**: * The most common and practical approach is through **discount factors (δ)**. A discount factor defines $\mathbf{W}_t$ as a proportion of the previous time step's prior covariance, effectively $R_t=\delta^{-1}G_tC_{t-1}G_t',\qquad W_t=\frac{1-\delta}{\delta}\,G_tC_{t-1}G_t$. This allows $\mathbf{W}_t$ to be automatically time-varying and adaptive, simplifying the specification of complex covariance elements to a single scalar. Different discount factors can be applied to different components of the state vector. * **Time-Varying Observational Variance (Vt)**: * This is handled through **variance discounting** or **discounted variance learning**. This technique models a decay of information about the observational precision ($φ_t = \frac{1}{V_t}$) over time, maintaining the conjugate Gamma distribution form for precision. The prior for $φ_t$ at time $t$ is derived by discounting the degrees of freedom and scale parameter from the previous posterior (e.g., $G[δnt-1/2, δdt-1/2]$ ). This makes the variance estimate more adaptive to recent data. The concept of "power-discounting" is also mentioned in relation to modifying the prior distribution, suggesting a general method for flattening distributions, which can be applied to precision parameters. * **Multivariate Extensions**: These discounting approaches extend to multivariate DLMs. For instance, **matrix normal/Inverse Wishart distributions** can be used to handle time-varying observational covariance matrices (`Σt`), often with dynamics defined by a **matrix beta evolution model**. ::: {.callout-tip title="State discounting ($\delta$) for $W_t$" #box-state-discount} For scalar $\delta\in(0,1]$ applied to the **state evolution**: $$ \begin{aligned} R_t &= \delta^{-1}\,G_t C_{t-1} G_t' \\ W_t &= R_t - G_t C_{t-1} G_t' \;=\; \tfrac{1-\delta}{\delta}\,G_t C_{t-1} G_t'. \end{aligned} $$ **Usage.** - Smaller $\delta$ ⇒ larger $W_t$ ⇒ faster adaptation. - Use **block discounts** (different $\delta$) per component of $\theta_t$. - For breaks/interventions: temporarily set $\delta\!\ll\!1$ on affected blocks. [@west2013bayesian Ch.6] [@prado2023time §4] [@DurbinKoopman2012TSABSS §2] ::: ::: {.callout-note title="Variance discounting ($\beta$) for observation variance $V_t$" #box-var-discount} Discount the **precision** $\phi_t=V_t^{-1}$ prior to keep $\mathbb E[\phi_t]$ fixed while inflating uncertainty. If $\phi_{t-1}\sim\mathrm{Gamma}(a_{t-1}, b_{t-1})$ (shape–rate), set \[ \phi_t\mid\mathcal D_{t-1}\sim\mathrm{Gamma}(\beta a_{t-1},\; \beta b_{t-1}), \quad \beta\in(0,1]. \] Then $\mathbb E[\phi_t]=a_{t-1}/b_{t-1}$ (unchanged) and $\mathrm{Var}(\phi_t) = (1/\beta)\,a_{t-1}/b_{t-1}^2$ (inflated). **Effect.** Forecasts remain Student-$t$; recent data get more weight. *(Refs: @west2013bayesian §10.8; @prado2023time §4)* ::: --- ### What are some common extensions and generalizations of NDLMs? {#sec-faq-extensions} The DLM framework is highly flexible and can be extended in various ways: * **Non-Normal and Non-Linear Dynamic Models**: * **Dynamic Generalized Linear Models (DGLMs)**: Extend DLMs by using exponential family distributions (e.g., Poisson for count data, Binomial for proportions) for the observational model, often involving non-linear link functions. * **General Non-Linear Dynamic Models**: Arise when parameters (e.g., `λ` in a transfer response function or a discount factor itself) introduce non-linearities into the system or observation equations. * **Stochastic Volatility (SV) Models**: Often formulated as non-linear/non-Gaussian state-space models where volatility parameters evolve dynamically, requiring specialized computational methods. * **Mixture Models**: Can be incorporated to handle non-normal error distributions or to model phenomena like occasional outliers. * **Multivariate and Matrix Normal DLMs**: * **Multivariate DLMs**: Generalize to handle vector-valued observations, allowing for joint modeling of multiple time series. * **Matrix Normal DLMs**: Provide a framework for multivariate time series analysis where the covariance structure across series is unknown, leveraging matrix-variate normal distributions for fully conjugate analyses. * **Dynamic Graphical Models**: Combine matrix-variate DLMs with Gaussian graphical models, allowing for structured and often sparse precision matrices, which is useful for scalability in high-dimensional time series. * **Dynamic Dependence Network Models (DDNMs)**: These models define multivariate dynamic models by coupling customized univariate DLMs, extending time-varying vector autoregressive (TV-VAR) models and allowing for flexible modeling of time-varying parameters and volatilities. * **Spatio-Temporal Models**: NDLMs form the foundation for **dynamic spatio-temporal models (DSTMs)**, which model processes that vary across both space and time. These models can also incorporate non-linearity and non-Gaussian elements. Hidden Resolution Models (HRMs) are a type of multiscale time series model that can be formulated as DLMs. --- ### The Normal Dynamic Linear Model: Definition, Model classes & The Superposition Principle Dynamic Linear Models (DLMs) extend classical linear regression to time-indexed data, introducing dependencies between observations through latent evolving parameters. A Normal DLM (NDLM) assumes Gaussian noise at both observation and system levels, enabling tractable Bayesian inference through the Kalman filter. While superficially complex, NDLMs are conceptually close to linear regression. Instead of I.I.D. observations indexed by $i$, we index data by time $t$ and allow parameters to evolve with time, resulting in a two-level hierarchical model. At the top level is the observation equation. Below this there is the evolution equation(s) that can be understood as a latent state transition model that can capture trends, periodicity, and regression. The evolution equations can have more than one level however we will see that with some work these are summarized into a matrix form. To make things simpler this is demonstrated using a white noise process and then a random walk model. What makes the NDLM somewhat different is that that there are two variance elements at two levels, necessitating learning more parameters. Once we cover these to models the instructor walks us though all the bits and pieces of the notation. Later we will see that we can add trends, periodicity, regression components in a more or less systematic way. However we need to pick and choose these components to get a suitable forecast function. This approach require an intimate familiarity with the data generating process to model. This approach is Bayesian in that we draw our parameters from a multivariate normal and use updating to improve this initial estimate by incorporating the data and we end up with a posterior i.e. we have distributional view of the time series incorporating uncertainties. Additionally we have a number of Bayesian quantities that can be derived from the model, such as - the **filtering distribution** that estimates the current state $\mathbb{P}r(\theta_t \mid \mathcal{D}_t)$, - the **forecasting distribution** - to predict future observation: $\mathbb{P}r(y_{t+h} \mid \mathcal{D}_t)$, - the **smoothing distribution** - retrospective estimate of past state: $\mathbb{P}r(\theta_t \mid \mathcal{D}_{T})\quad t<T$ and - the **forecast function** when $F_t=F$ and $\mathbf{G}_t=\mathbf{G}$ $f_t(h)=\mathbb{E}[y_{t+h} \mid \mathcal{D}_{T}] = F'G^h \mathbb{E}[\theta_{t} \mid \mathcal{D}_{T}]$ - the usual credible intervals for forecasts and parameter estimates. However the DLM framework is quite flexible and once you understand it it can be adapted to support features like seasonality using the superposition principle. NDLMs don't need to be non-stationary time series. As far as I cen tell NDLMs are just DLM with their errors distributed normally at the different levels.