103 Normal Dynamic Linear Models, F.A.Q

103.1 NDLM FAQ

Everything you wanted to know about NDLMs but were afraid to ask meets Everything I would tell my younger self from 6 months ago.

I found learning dynamic linear models challenging yet rewarding. As I progressed through the material, I had some questions. And later, reading the textbooks, I found some answers but had even more questions. By the time my notes were almost complete, I was able to come up with answers to all but the most challenging ones. I think I was able to ask the kind of questions that might make learning easier for new students.

I’m not sure everything in here is 100% correct, but more likely than not this should prove a helpful resource to other students. If you do find any errors, please let me know.

Now I am working through exercises from textbooks, cf. Section 103.1.2 and many questions have come up which I believe are instrumental to people unfamiliar with DLMs. Despite having learned about DLM and done some work on time series, it is still a challenge not just to understand but to even ask thoughtful questions about this topic. Some of these questions are more about dispelling misconceptions I had.

Note: I am using DLM and NDLM interchangeably. NDLM is a specific case of DLMs where the state and observation equations are linear and the errors are normally distributed. The course text books state that it is possible to use other members of the exponential family but for DLMs the Normally distributed case is the the default and most commonly used.

103.1.1 What are some Pros and Cons of DLMs ?

The main issues of DLM is they are hard hard to understand. But lets start with the pros:

So why use DLM?

They are a unifying framework that encompass many simpler time series models. As you get more proficient with DLMs you should be able to write more and more TS models using this framework.
They don’t have the stationarity requirements of AR(p) or ARMA(P,Q)
They let you get results from much less data than neural network models
They can give you uncertainties for the forecasts and parameters.
They are faster to fit than typical MCMC methods so you get to make faster inferences and more iterations on models with different simplifying assumptions before dealing with the full complexity of your problem.
They can be easily updated online with new data, making them suitable for online learning and real-time applications.
They can handle missing data and different levels interventions.
They are interpretable - you can easily view the changing parameters over time.

So why are DLMs challenging?

You don’t get a nice ML style API and you don’t get a nice report or even a bunch of diagnostics plots.
DLM generalize and utilize many other models. To use the their full power you should
1. have a good grasp not only of how these simpler models operate
2. how they are represented in the DLM framework and how
3. how they are integrated into one big model. While we cover the items 2. and 3 adequately, the course only goes into depth regarding AR(p) models. For example, time varying regression component can fit multiple time series simultaneously are likely to be new. Also while ARMA is discussed we don’t really cover it in depth or explain the DLM representation.
- ARIMA is another model which can be represented in the DLM framework by using a polynomial trend for the integration and ARMA for the rest but this is not discussed.
Unlike AR(p) or even ARMA(p,q) we are not dealing with a model for which we set up a Bayesian model with some \theta some priors for the thetas and pour in the data and get a posterior for the parameters, and a posterior predictive distribution for doing predictions.
DLM are hidden state space model. While we discuss this, we don’t delve into the state representation, the nature of the markov property nor the long term distribution. This state space aspect is hard to reconcile with AR(p) and Seasonal components which have parameters that act like a long term memory or reference past values. So the state space representation is subtle and its exposition may leaves gaps.
DLM make use of Kalman filters in the filtering and smoothing operations this is a ~~complex~~ algorithm with many variants which deserves it own specialization and thus cannot be cover in this course. However, filtering and smoothing are essential operations for doing inference they have many moving parts that are also not covered in a systematic way unfortunately the full complexity appears and reappears whenever we consider the different settings for inference.
- I think that a few video from the KF boot camp could be helpful.
- The cheat sheet below outlining the names of the bayesian constructs in the Kalman filter recursions and their applications can dispel some of the confusion once it is memorized or made handy.
A number of results in the course are theoretical while others are practical but the boundary is never clear. As far as I can tell much of the inference material is not practical in nature but rather a sequence of derivations that lets us drop unrealistic assumption we must initially make about the model.
Similar to Neural Networks, working with DLMs require making informed choices regarding architecture, and choices of hyperparameters. Once these are made to get the DLM to work with requires that the dimensions of the internal matrices and that their sizes line up correctly.
Setting the Systems evolutions is neither straight forward nor as well understood as it is made to appear. E.g. for EEG data how well do we understand the hidden dynamics of the brain’s response to an external shock?
The system equations allows us to introduce multiple hierarchical representations but these must be folded into a matrix using a very specific state space representation.

One up side to DLMs Another attractive feature is that unlike neural models they can be interpreted and can be fit with a relatively less data. - Another facet is that the number of degrees of freedom and the number of parameters are not the same. - Architecture: - They can contain a polynomial trend that is smooth - They can contain AR(p) components which can be jagged - They can contain MA(q) components which requires an augmented-state construction. - They can encode seasonality using dummies (jumping patterns) - They can encode seasonality using sinusoids (complex smooth patterns) - They can also incorporate time indexed regressors - They can incorporate geo-spatial data but this is not covered in this course or the R package and is more of an extension of the DLM framework to DSTM

We can literally add the components into one big DLM. However here things get more interesting.

Each component DLM has both an observational noise and a system noise. When we put the components into superposition In superposition there is one observation variance/covariance V_t for the combined model; component-specific evolution variances live in blocks of W_t. So when we add components, we may need to customize the variances. For the combined model; component-specific evolution variances live in blocks of W_t. So if we add components, we need to customize the variances.

A second bit that’s tricky is the model’s dimensions and how many of those are free variables. Filtering gives us estimates of the parameters we call put into the \theta but we will also need to infer the two variances.

Also in the (Prado, Ferreira, and West 2023) book the authors point out that we will also want to infer some parameters of F and G. Giving it some thought I think they are:

the degree for the polynomial trend components
the number of harmonics for the seasonal component
the p and q parameters for the ARMA components
which regressors and interaction terms to include in the regression component

In other words we may well be interested in doing additional inference for the model selection.

103.1.2 What Books are there on NDLMs?

The first two books are hard to read but they also contain a slew of exercises. Like most mathematical textbooks you won’t get more than 25% of the material unless you do enough of these exercises. I doubt you’ll even memorize the recursion equations and their interpretations unless you do enough of these. P.S. unless you are in the middle of your PhD doing the exercises is very very taxing. There are no solutions and even reading the derivation in my own solution was taxing. With enough details I could be fairly sure I had a decent answer. But it could take a couple of hours or a couple of days to get there. I feel that to a large extent I would learn more making models than derivations. It is clear that to handle missing data, handle structural change points or intervention requires great facility with the with the Kalman filtering as well as setting up the data and model so that the DLM package can do its magic. This is well beyond the level of the course.

That said lots of the material seems rather theoretical and I want to use NDLM in some sophisticated models. In hindsight I do believe this stuff isn’t as complicated to the student. Other books on the subject seem to be from Econometrics or Ecology and other applied areas exist and are likely far more accessible. There is also more software in the wild that can facilitate things like finding structural changepoint.

(West and Harrison 2013) which lays down the theory but is long winded. Often meanders rather than giving a clear and concise expiation. Perhaps the authors assume the readers will read it serval times during the PHD and a few more for their post doc. There are many gaps in this book
(Prado, Ferreira, and West 2023) is much longer, more recent, far more advanced, covers many more models, tends to reference many research papers, poorly motivated and suffers from an even greater tendency to meander rather than provide the deep insights its maven of authors clearly possess. The book has many appendices that feel more like handouts on unrelated topics with one or two result that we might have used.
(Petris, Petrone, and Campagnoli 2009) Is the most accessible of the trio. Part of the excellent Use R! series which I’ve skimmed though years and years ago. The book takes a hands on approach to using the DLM library in R. It is a text that fills in some of the gaps in the two text above. However It isn’t so easy to pick up NDLMs without background in Bayesian statistics and time series.

The following titles are books I have not read but which I noticed during my research.

(Durbin and Koopman 2012) which is a classic. It is a very mathematical book that is hard to read but it does have a lot of exercises. It is a great book for those who want to understand the mathematical foundations of the Kalman filter and its applications in time series analysis.
(Harvey 1990) is an excellent book that covers the Kalman filter and its applications in time series analysis. It is a more accessible book than Durbin and Koopman, but it is still quite mathematical. It is a great book for those who want to understand the Kalman filter and its applications in time series analysis.

103.1.3 What is a Normal Dynamic Linear Model (NDLM)?

A Normal Dynamic Linear Model (NDLM), often simply called a Dynamic Linear Model (DLM) when normality is understood, is a class of dynamic models commonly assumed to have normal (Gaussian) distributions. It is characterized for each time t by a set of quadruples \{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\}.

The core of an NDLM is defined by two sequential equations: * Observation Equation: Y_t = \mathbf{F}_t^\top \theta_t + \nu_t, where Y_t is the observation vector, \theta_t is the parameter (or state) vector, \mathbf{F}_t is a known design matrix, and \nu_t is an observational noise term assumed to be normally distributed with zero mean and known variance matrix V_t (\nu_t \sim \mathcal{N}[0, V_t]). * System Equation: \theta_t = \mathbf{G}_t \theta_{t-1} + \omega_t, where G_t is a known system evolution and \omega_t is a system noise term assumed to be normally distributed with zero mean and known variance matrix W_t (\omega_t \sim \mathcal{N}[0, W_t]).

The error sequences (\nu_t and \omega_t) are assumed to be internally independent, mutually independent, and independent of the initial information. NDLMs are widely used for modeling time series, capturing how processes change over time. Their flexibility and generality allow them to handle complex problems and directly quantify uncertainty, enabling the fitting of models with many parameters and intricate probability specifications.

103.1.4 What are these moments we keep hearing about ?

Let’s take a ~~moment~~ time out to unpack the moments reference.

When we talk about filtering we have two or three equations called the filtering equations. These have a recursive form and are often either conditionaly or directly Normal or Student-t. And the first and second moments of these distributions are fed into the next update equations. They have names and interpretation which I might cover in another question. So that is the main usage of moments but we also have priors and other distribution and they also have moments and the professor might be talking about these ones as well. But the priors are the starting point of the recursive formulas so what I explained initially is a good place for an intuition.

moments cheat-sheet

\begin{aligned} \color{RoyalBlue}{a_t} &= G_t m_{t-1} & \text{state prior mean}\\ \color{RoyalBlue}{R_t} &= G_t C_{t-1} G_t' + W_t & \text{state prior var}\\[2pt] \color{Magenta}{f_t} &= F_t' a_t & \text{1-step forecast mean}\\ \color{Magenta}{Q_t} &= F_t' R_t F_t + V_t & \text{1-step forecast var}\\[2pt] \color{BrickRed}{A_t} &= R_t F_t / Q_t & \text{Kalman gain}\\ \color{ForestGreen}{m_t} &= a_t + A_t (y_t - f_t) & \text{state post mean}\\ \color{ForestGreen}{C_t} &= R_t - A_t A_t' Q_t & \text{state post var} \end{aligned} \tag{103.1}

103.1.5 Can you explain Inference in the NDLM

We covered four cases in the notes, yet it is easier to miss the big picture.

We saw several derivations of filtering, etc, with different settings for v_t, \mathbf{W}_t

v_t and \mathbf{W}_t both known - (Normal conjugate structure)

2. v_t=v and \mathbf{W}_t - known (var and covar scaled by V , \mathcal{IG} prior for v and Student t for forecast and state)

v_t=v known and \mathbf{W}_t unknown/changing set via \delta a discount factor
v_t=v unknown and \mathbf{W}_t unknown/changing set via \delta a discount factor.

As far as I can tell from the start of (Prado, Ferreira, and West 2023, sec. 4.3) these are strong simplifying assumptions on the road to a more general case with v_t=v unknown and \mathbf{W}_t unknown. As best as I can tell, for our time series analysis, input is the same for all cases….

Can we also do away with the assumption of having constant observational variance? Isn’t it a strong assumption for us to make?

103.1.6 Can use Bayesian methods to infer the F and G of an NDLM ?

It is too easy to get hung up on the word Bayesian here. I mean it would be neat if there was an algorithm figured out F and G from the data. It would be even nicer if there was an algorithm that could recover dynamics from the data. (Phase space reconstruction algorithms are possible in chaotic systems where trajectories are dense in the phase space but if the trajectory isn’t chaotic we are out of luck)

Anyhow we get the job as modelers to setup the model and make different assumption. For a Kalman filter we need system dynamics. Using DLM entails describing these via superposition of a trend, periodicity, ARMA(p,q) and a time based regression. This is our choice regarding the inductive bias for our model and it is subjective. It reflects our view of the underlying process we are trying to model.

To sum up:

It is your job as a good Bayesian to make your assumptions explicit and to be aware of their implications. If you specify the model well you may imagine the busts if two exponents of subjective probability like de Finetti and Ramsey nodding at you in approval from their pedestals. And if you make poor choices you might hear the bust of Rudolf E. Kálmán having a fit as his filter chokes on your data.

103.1.7 Isn’t the NDLM over/under specified?

If \{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\} changes every time point t i.e. is some or all of the components of the model change we are likely to have overfitting or underfitting problems (too many/few parameters compared to the data). The Kalman Filter performs filtering and smoothing and while these operations have optimality guarantees these stabilize only if they are permitted to converge to some limit (i.e. enough steps where enough might depend on V_t, \mathbf{W}_t)

NDLM framework is very flexible, generalizing many known time series models in a single framework. In practice changes in the model at some index t should be due to:

handle NA’s,
handle intervention or
handle structural change point (These are covered in Section 103.1.23).

Missing, interventions, and structural breaks (practical cookbook)

Missing y_t:
- Skip measurement update at t: keep (m_t,C_t)=(a_t,R_t); continue with t\!+\!1

Interventions (level shift, temporary shock, ramp):
- Add a regressor to F_t with known design (step, pulse, ramp).
- Give its state component a small \delta (fast adaptation) or a spike-and-slab prior.

Structural break:
- Temporarily reduce \delta (or inflate W_t) for the block governing level/slope.
- Optionally reinitialize (m_t,C_t) for that block.
- For recurring regimes, consider switching DLMs / Markov-switching LDS.

c.f. (Durbin and Koopman 2012, sec. 11.5) (West and Harrison 2013, Ch.11)

103.1.8 How much data do I need to fit an NDLM with k dimensions ?

The amount of data required to fit an NDLM with k dimensions depends on several factors, including the complexity of the model, the number of parameters to be estimated, and the desired level of precision in the estimates. In general, more data is needed for:

Higher Dimensionality: As the number of dimensions (k) increases, the parameter space becomes larger, requiring more data to obtain reliable estimates.
Model Complexity: More complex models with intricate structures (e.g., multiple state variables, non-linear relationships) typically require more data to accurately capture the underlying dynamics.
Desired Precision: If high precision is needed in the parameter estimates or predictions, more data will be necessary to reduce uncertainty.

A common rule of thumb is to have at least 10-20 observations per parameter to be estimated. However, this is a not a good rule for DLMs. Three full seasons might be more appropriate, and the actual data requirements may vary based on the specific context and goals of the analysis. One recommendation is tuning \delta by one-step predictive log score and checking standardized forecast errors. (West and Harrison 2013 ch. 6)

103.1.9 How are filtering, smoothing, and forecasting performed in NDLMs?

NDLMs provide a coherent framework for these key time series analyses:

Filtering: This process estimates the current state of the system (\theta_t) based on all available observations up to the current time t. In NDLMs, this is typically achieved using Kalman filter recurrences. The posterior mean of the state is a weighted average of the prior mean and the current observation, with weights proportional to their precisions.
Smoothing (Retrospective Analysis): This involves estimating past states (\theta_s for s < t) by incorporating all available data, including future observations up to a fixed interval T. It provides a retrospective view of the parameter values based on the entire dataset. Conditional independence results are crucial for developing efficient smoothing algorithms.
Forecasting: This involves predicting the future behavior of the system state and observations (Y_{t+k}, \theta_{t+k}) for k steps ahead, given data up to the current time t. Forecast functions define the qualitative form and expected numerical development of the time series.

These processes provide linear posterior means and variances, which is a justification for their use even outside strict normality assumptions.

103.1.10 What are the computational challenges and methods used for NDLMs with unknown parameters?

When NDLM parameters, especially variances, are unknown and time-varying, obtaining exact analytical solutions for the full posterior distributions becomes complex. This necessitates the use of various computational techniques:

Approximation Techniques:
- Normal Approximation: Often applied to posterior distributions, especially for parameters that are difficult to model directly.
- Linearization: For models with non-linear components, linear approximations can be used to transform them into (approximate) DLMs, allowing for standard DLM analysis.
Simulation-Based Methods (Markov Chain Monte Carlo - MCMC):
- General MCMC: These methods are widely used for posterior inference in complex dynamic models.
- Gibbs Sampling: A common MCMC approach, it is often easily implemented for sampling posterior distributions of model parameters and state vectors within a fixed time interval, especially for conditionally linear/normal models.
- Forward Filtering, Backward Sampling (FFBS): A specific and efficient MCMC algorithm introduced for sampling the full set of state vectors from the posterior distribution in conditionally Gaussian DLMs. It exploits the Markovian structure of the model.
- Particle Filters (Sequential Monte Carlo - SMC): These are crucial for “on-line” or recursive inference, where new observations frequently arrive and re-running an entire MCMC every time is computationally inefficient. Particle filters are particularly well-suited for non-linear and non-Gaussian state-space models. Rao-Blackwellized particle filters can improve efficiency in conditionally Gaussian contexts.

103.1.11 How are NDLMs specified, designed?

Model Specification and Design: DLMs are often constructed by superposition (combining) two or more component DLMs, each capturing a specific feature like trend, seasonality, or regression. The starting point for model design is typically the desired forecast function, which determines the qualitative and quantitative form of the time series development.
Hierarchical Models: These are powerful extensions for problems involving multiple, related parameters. Data are modeled conditionally on parameters, which themselves are given a probabilistic specification in terms of hyperparameters. This structure allows “borrowing strength” across related groups or units, enhancing inference.

103.1.12 How are NDLMs checked for adequacy?

Model Checking (Diagnostics): Essential for assessing how well the model fits the data and substantive knowledge:
- Posterior Predictive Checks: Involve simulating replicated datasets from the model’s posterior predictive distribution and comparing them to the observed data. For NDLMs, this often includes examining standardized forecast errors (et/sqrt(Qt)), which should resemble Gaussian white noise if the model is adequate. Graphical tools like QQ-plots and empirical autocorrelation functions are used to assess normality and uncorrelatedness.
- Sensitivity Analysis: Involves recomputing posterior inferences under plausible alternative models to evaluate the robustness of conclusions to modeling assumptions.
- Model Comparison: Competing models can be evaluated based on measures like predictive accuracy (e.g., log score), information criteria (e.g., AIC, DIC, WAIC), or Bayes factors.

Diagnostics checklist

Use on standardized forecast errors

Let r_t = e_t/\sqrt{Q_t}.

White noise: ACF/PACF of r_t; Ljung–Box on r_t (not raw residuals).
Normality: QQ-plot of r_t; heavy tails ⇒ variance discounting or robust component.
Calibration: coverage of 1-step predictive intervals; PIT histogram.
Predictive score: rolling log score or CRPS for model comparison.
Breaks: spikes/variance jumps in r_t ⇒ local \delta\downarrow or add intervention regressor.

103.1.13 What is the difference between DLM and NDLM?

DLMs are the subject of (West and Harrison 2013). Though most of the time they are actually discussing NDLMs, which are a special case of DLMs in which the priors and the variance terms come from a Normal distribution. We use NDLM to emphasis this normality assumption which simplifies the analysis letting us replace Bayes rule in derivation with powerful results from Normal theory and allows for the use of conjugate priors, making the Bayesian updating process more straightforward. DLM work well with any member of the exponential family but these models are not as well explored as the Normal ones.

103.1.14 Why are \mathbf{F}_t and \mathbf{G}_t a vector and a matrix respectively?

It may helps to think about \mathbf{F} and \mathbf{G} as follows:

If we start with \mathbf{G}_t we see it is a linear transformation that describes the dynamics of state vector evolves over time. I like to think about it as a Hidden Markov state transition matrix.

And once we have the updated state \mathbf{F}_t^\top acts as a linear transformation that maps the latent state \vec{\theta}_t into the observation space, of y. While \nu_t injects some observation noise.

In the state evolution equation \theta_t = G_t\theta_{t-1}+\omega_t we pre-multiply out \theta_t \mathbf{G}_t to deterministically update the state and we then add \omega_t to account for process noise.

In other words, \mathbf{F}_t takes the current hidden state \theta_t and produces an observation y_t, while \mathbf{G}_t takes the current state and produces the next state.

103.1.15 Why is a DLM called a linear model?

This is because both the observation equation is a linear equation that relates the observations to the parameters in the model and the system equation is a linear equation that tells us how the time-varying parameter is going to be changing over time. This is why we call this a linear model.

103.1.16 Why are the noise terms \nu_t and \omega_t assumed to be normally distributed?

This is a common assumption in time series analysis. It is a convenient assumption that allows us to perform Bayesian inference and forecasting in a very simple way. And this is why we call this a normal dynamic linear model.

103.1.17 Isn’t this just a hierarchical model?

It is a hierarchical model but not just. First the observation and system evolution equations are also auto-recursive giving them a temporal structure. We have a model for the observations and a model for the system level. The system level is changing over time and the observations are related to the system level through the observation equation. As explained above G is a matrix i.e. a set of simultaneous equations and these may capture a hierarchial, multilevel or other structures.

We saw in the development of the p order polynomial trend model that we can add p levels to the evolution equation. And so it is possible to extend this model to more complex structures if we wish to do so by adding another level, etc…

However as we add more level they must be written in a representation that the KF algorithm can process.

This means we will take all these levels and fold them into G and keep the temporal structure of the two level overall framework!

One more thought on structure is that we can combine different dlm into bigger one using stacking. This isn’t something we considered before for hierarchical models so again not just.

103.1.18 What is the difference between NDLMs and AR(p)/ARIMA model?

NDLMs are built of component and one of the components can be an AR(p) model. AR(p) models need to be stationary but NDLM have no such requirement. Note that what I said above regarding AR(p) applies to an ARMA component which is a more general model than AR(p).

Intuitively though the NDLM can have seasonal and trend components and if they do account for the non stationary part of the series then the AR(p) might account for the stationary residual.

TODO: It is unclear that this does happen nor that there are guarantees that the algorithms will do this.

103.1.19 What are moments for NDLM?

The instructor and the book references parts of the NDLM model as moments what is that about? We just said in Q1 that NDLM posit a normal structure on the priors and errors. When they talk about the moments they mean the mean and variance in these distributions. These are quantities of interest. The priors obviously are known. The errors are not generally unknowns

To make things a bit clearer, the Kalman filter is optimal in some sense at assigning the state of the system at time t given the data up to time t. The state is what we call \theta_t what the Kalman filter can’t eliminate are the impact of the variance in the system and the observation level. There is always an error. However there are theoretical guarantees that the Kalman filter will provide the best linear unbiased estimate (BLUE) of the state so long as sufficient data has been seen.

The moments are the inputs and outputs of the model. We get we propagate posterior means/variances; 𝑚_t is the posterior mean, not an MLE. But we will get a posterior for the variance of the system and the observation level. This is the other moment of the model and it is here that we actually need to make strategic decisions. We can set a complex prior and get many measurements to get a good posterior for the variance or more likely we don’t really know much about the errors and we prefer to postulate something as simple as possible and then use the data to get a posterior for the variance. (Simple here means a model that only requires to compute and interpret the noise at the observation level i.e. a difference between the model’s forecast and the actual observation we see next)

103.1.20 What is this thing called \delta can I ignore it?

In the NDLM where we don’t know the system variance we can replace it under a simplifying assumption by decomposing R_t. In the filtering equations. I think of this as providing us with a “surrogate” model. I.e. a simpler model that approximate out original model.

What we do is we decompose the system model into a deterministic part and a stochastic part. We update the covariance using a reduced form which means we have a raw estimate of covariance based on how the previous times system error term evolves according to G i.e. G C_t G^\top and we use \delta as a weight to set how much of that term we want to pass through.

So to sum up: we can use the discount factor hyper-parameter denoted as \delta which we learned to estimate using MSE on a loss associated with \delta. This question is an informal outline of (specifying-the-system-covariance-matrix-via-discount-factors?)

103.1.21 What is the Kalman gain

The Kalman gain is a key component of the Kalman filter, which is used in NDLMs to update the state estimates based on new observations. It determines how much weight to give to the new observation relative to the current state estimate.

103.1.22 How does the Kalman Filter features in DLMs

The Kalman filter is an iterative algorithm driving the NDLMs. It is used for estimating the hidden states of the model and updating these estimates as new observations become available. But Kalman Filters require four matrices to do their magic and the NDLM code handles putting everything into a form which is compatible with the Kalman filter.

Unfortunately, the Kalman filter has a tendency to amplify noise, if the V_t is underestimated. Which can be problematic in practice. This means that if the model is not well-specified or if the noise characteristics change over time, the Kalman filter may produce unreliable estimates.

103.1.23 What is a structural change point.

This is a point in the time series where the underlying data generating process changes in a way that cannot be captured by the existing model structure. This can happen due to various reasons such as:

changes in external conditions (e.g., economic shifts, policy changes)
changes in the behavior of the system being modeled (e.g., a sudden change in consumer behavior)
introduction of new variables or factors that were not previously considered.

As far as modeling the system, we need to revise our parameters, or if the change to the process is more radical we may need to incorporate additional parameters, by adding a component to the DLM.

There is a lot of criticism of the Facebook prophet algorithm breaking at some point. Digging deeper this is due to a distributional drift or more likely even a structural change point. FB prophet is

much simpler than NDLM and
uses Stan for MCMC and doesn’t use KF for its inference and
its regular users lack the ability to modify its internal components.

So it is a huge problem to fix FB prophet if it blows up in production and is used in a recommendation system. Retaining FB prophet may not work.

In contrast DLM theory emphasizes that DLMs are open to at any time point to changes at all levels. (Without worrying the student how in practice they are supposed to do this or how robust the model is to such changes) After all the Kalman filter is fantastic at using feedback to give an optimal updates. But it probably just as hard to handle a structural change, intervention or NA. I think that the extra code we got in class which takes or creates list of matrices indexed in time is exactly an NDLM in short notation that handle this. It then becomes a matter of practice in inserting NAs, interventions and adding or changing components from a certain point.

The basic structure of DLM however does allow us to handle structural change points simply by updating the system evolution matrix G_t and the observation matrix F_t at any time we need to handle a structural change point. The challenge is that

We need to detect the change point in the underlying process. (Using data drift detection algorithms)
We need to know how to incorporate this change to the model. (This is a modeling challenge that requires a case by case analysis and usually can only be done in retrospect).

In the past few decades there has been a lot of interest for example in econometrics for models that handle regime changes for volatility which has spawned a whole research area and a family of models starting with GARCH, EGARCH, TGARCH, CGARCH, FIGARCH, HYGARCH, etc…

If we realize ahead of time that our DLM may tend to switch between different regimes we can actually incorporate this into our model. There are state space models that incorporate identifying and switching between different models. These are called Markov switching models (MSMs) and they are a generalization of NDLMs. MSMs can be used to model time series with structural change points, but they are more complex and require more data to estimate the parameters. These models are covered briefly in (Davidson-Pilon 2015) and in full detail in (Frühwirth-Schnatter 2006).

103.1.24 What does Polynomial mean in a Polynomial trend DLM ?

I was confused about this and there are three good reasons to be!

The AR(p) has a characteristic polynomial which is unrelated to do with the Polynomial in these trend models. They are AR(p) component of DLMs which create jagged forms in the forecast function, while polynomial models are popular as they add a smooth trend. The polynomial model is a sub-model that covers the trend. This allows us to model the residual as an AR(p) even if the data is non-stationary
In (West and Harrison 2013) the authors talks about Taylor series approximation before delving into the first polynomial trend model. This is an intuitive way to think about about multiple regression model. This is a polynomial in which we can pick the order one or two to get better approximations lingo common in physics and numerical methods. However it is completely unrelated to the Polynomial that gives the model its name.

For a polynomial model of order p, when we multiply out the terms of the forecast function we get a polynomial of order p-1.

103.1.25 Can NDLMs handle unknown and non-constant observational (V_t) and system (W_t) variances?

NDLMs can be extended to handle cases where both observational variance (V_t) and system variance (W_t) are unknown and vary over time. This moves beyond the simplest Kalman filter assumptions of known variances.

Here’s how this is typically managed:

Time-Varying System Variance (W_t):
- The most common and practical approach is through discount factors (\delta). A discount factor defines \mathbf{W}_t as a proportion of the previous time step’s prior covariance, effectively R_t=\delta^{-1}G_tC_{t-1}G_t',\qquad W_t=\frac{1-\delta}{\delta}\,G_tC_{t-1}G_t. This allows \mathbf{W}_t to be automatically time-varying and adaptive, simplifying the specification of complex covariance elements to a single scalar. Different discount factors can be applied to different components of the state vector.
Time-Varying Observational Variance (V_t):
- This is handled through variance discounting or discounted variance learning. This technique models a decay of information about the observational precision (\phi_t = \frac{1}{V_t}) over time, maintaining the conjugate Gamma distribution form for precision. The prior for \phi_t at time t is derived by discounting the degrees of freedom and scale parameter from the previous posterior (e.g., G[\delta n_{t-1}/2, \delta d_{t-1}/2] ). This makes the variance estimate more adaptive to recent data. The concept of “power-discounting” is also mentioned in relation to modifying the prior distribution, suggesting a general method for flattening distributions, which can be applied to precision parameters.
Multivariate Extensions: These discounting approaches extend to multivariate DLMs. For instance, matrix normal/Inverse Wishart distributions can be used to handle time-varying observational covariance matrices (\Sigma_t), often with dynamics defined by a matrix beta evolution model.

State discounting (\delta) for W_t

For scalar \delta\in(0,1] applied to the state evolution: \begin{aligned} R_t &= \delta^{-1}\,G_t C_{t-1} G_t' \\ W_t &= R_t - G_t C_{t-1} G_t' \;=\; \tfrac{1-\delta}{\delta}\,G_t C_{t-1} G_t'. \end{aligned}

Usage.

Smaller \delta ⇒ larger W_t ⇒ faster adaptation.
Use block discounts (different \delta) per component of \theta_t.
For breaks/interventions: temporarily set \delta\!\ll\!1 on affected blocks.

(West and Harrison 2013, Ch.6) (Prado, Ferreira, and West 2023, sec. 4) (Durbin and Koopman 2012, sec. 2)

Variance discounting (\beta) for observation variance V_t

Discount the precision \phi_t=V_t^{-1} prior to keep \mathbb E[\phi_t] fixed while inflating uncertainty.

If \phi_{t-1}\sim\mathrm{Gamma}(a_{t-1}, b_{t-1}) (shape–rate), set

\phi_t\mid\mathcal D_{t-1}\sim\mathrm{Gamma}(\beta a_{t-1},\; \beta b_{t-1}), \quad \beta\in(0,1].

Then \mathbb E[\phi_t]=a_{t-1}/b_{t-1} (unchanged) and
\mathrm{Var}(\phi_t) = (1/\beta)\,a_{t-1}/b_{t-1}^2 (inflated).

Effect. Forecasts remain Student-(t); recent data get more weight.
Refs: (West and Harrison 2013, sec. 10.8) (Prado, Ferreira, and West 2023, sec. 4)

103.1.26 What are some common extensions and generalizations of NDLMs?

The DLM framework is highly flexible and can be extended in various ways:

Non-Normal and Non-Linear Dynamic Models:
- Dynamic Generalized Linear Models (DGLMs): Extend DLMs by using exponential family distributions (e.g., Poisson for count data, Binomial for proportions) for the observational model, often involving non-linear link functions.
- General Non-Linear Dynamic Models: Arise when parameters (e.g., λ in a transfer response function or a discount factor itself) introduce non-linearities into the system or observation equations.
- Stochastic Volatility (SV) Models: Often formulated as non-linear/non-Gaussian state-space models where volatility parameters evolve dynamically, requiring specialized computational methods.
- Mixture Models: Can be incorporated to handle non-normal error distributions or to model phenomena like occasional outliers.
Multivariate and Matrix Normal DLMs:
- Multivariate DLMs: Generalize to handle vector-valued observations, allowing for joint modeling of multiple time series.
- Matrix Normal DLMs: Provide a framework for multivariate time series analysis where the covariance structure across series is unknown, leveraging matrix-variate normal distributions for fully conjugate analyses.
Dynamic Graphical Models: Combine matrix-variate DLMs with Gaussian graphical models, allowing for structured and often sparse precision matrices, which is useful for scalability in high-dimensional time series.
Dynamic Dependence Network Models (DDNMs): These models define multivariate dynamic models by coupling customized univariate DLMs, extending time-varying vector autoregressive (TV-VAR) models and allowing for flexible modeling of time-varying parameters and volatilities.
Spatio-Temporal Models: NDLMs form the foundation for dynamic spatio-temporal models (DSTMs), which model processes that vary across both space and time. These models can also incorporate non-linearity and non-Gaussian elements. Hidden Resolution Models (HRMs) are a type of multiscale time series model that can be formulated as DLMs.

103.1.27 The Normal Dynamic Linear Model: Definition, Model classes & The Superposition Principle

Dynamic Linear Models (DLMs) extend classical linear regression to time-indexed data, introducing dependencies between observations through latent evolving parameters. A Normal DLM (NDLM) assumes Gaussian noise at both observation and system levels, enabling tractable Bayesian inference through the Kalman filter.

While superficially complex, NDLMs are conceptually close to linear regression. Instead of I.I.D. observations indexed by i, we index data by time t and allow parameters to evolve with time, resulting in a two-level hierarchical model. At the top level is the observation equation. Below this there is the evolution equation(s) that can be understood as a latent state transition model that can capture trends, periodicity, and regression. The evolution equations can have more than one level however we will see that with some work these are summarized into a matrix form.

To make things simpler this is demonstrated using a white noise process and then a random walk model. What makes the NDLM somewhat different is that that there are two variance elements at two levels, necessitating learning more parameters. Once we cover these two models the instructor walks us though all the bits and pieces of the notation. Later we will see that we can add trends, periodicity, regression components in a more or less systematic way. However we need to pick and choose these components to get a suitable forecast function. This approach require an intimate familiarity with the data generating process to model.

This approach is Bayesian in that we draw our parameters from a multivariate normal and use updating to improve this initial estimate by incorporating the data and we end up with a posterior i.e. we have distributional view of the time series incorporating uncertainties. Additionally we have a number of Bayesian quantities that can be derived from the model, such as

the filtering distribution that estimates the current state \mathbb{P}r(\theta_t \mid \mathcal{D}_t),
the forecasting distribution - to predict future observation: \mathbb{P}r(y_{t+h} \mid \mathcal{D}_t),
the smoothing distribution - retrospective estimate of past state: \mathbb{P}r(\theta_t \mid \mathcal{D}_{T})\quad t<T and
the forecast function when F_t=F and \mathbf{G}_t=\mathbf{G} f_t(h)=\mathbb{E}[y_{t+h} \mid \mathcal{D}_{T}] = F'G^h \mathbb{E}[\theta_{t} \mid \mathcal{D}_{T}]
the usual credible intervals for forecasts and parameter estimates.

However the DLM framework is quite flexible and once you understand it it can be adapted to support features like seasonality using the superposition principle. NDLMs don’t need to be non-stationary time series.

As far as I cen tell NDLMs are just DLM with their errors distributed normally at the different levels.

--- date: 2024-11-06 title: "Normal Dynamic Linear Models, F.A.Q" subtitle: Time Series Analysis description: "Normal Dynamic Linear Models (NDLMs) are a class of models used for time series analysis that allow for flexible modeling of temporal dependencies." categories: - Bayesian Statistics - Time Series keywords: - Time Series - Filtering - Kalman filtering - Smoothing - NDLM - Normal Dynamic Linear Models - Polynomial Trend Models - Regression Models - Superposition Principle - R code fig-caption: Notes about ... Bayesian Statistics title-block-banner: images/banner_deep.jpg --- ## NDLM FAQ {#sec-ndlm-faq} **Everything you wanted to know about NDLMs but were afraid to ask** meets **Everything I would tell my younger self from 6 months ago**. I found learning dynamic linear models challenging yet rewarding. As I progressed through the material, I had some questions. And later, reading the textbooks, I found some answers but had even more questions. By the time my notes were almost complete, I was able to come up with answers to all but the most challenging ones. I think I was able to ask the kind of questions that might make learning easier for new students. I'm not sure everything in here is 100% correct, but more likely than not this should prove a helpful resource to other students. If you do find any errors, please let me know. Now I am working through exercises from textbooks, cf. [@sec-faq-books] and many questions have come up which I believe are instrumental to people unfamiliar with DLMs. Despite having learned about DLM and done some work on time series, it is still a challenge not just to understand but to even ask thoughtful questions about this topic. Some of these questions are more about dispelling misconceptions I had. Note: I am using DLM and NDLM interchangeably. NDLM is a specific case of DLMs where the state and observation equations are linear and the errors are normally distributed. The course text books state that it is possible to use other members of the exponential family but for DLMs the Normally distributed case is the the default and most commonly used. ### What are some Pros and Cons of DLMs ? The main issues of DLM is they are hard hard to understand. But lets start with the pros: So why use DLM? 1. They are a unifying framework that encompass many simpler time series models. As you get more proficient with DLMs you should be able to write more and more TS models using this framework. 2. They don't have the stationarity requirements of AR($p$) or ARMA($P,Q$) 3. They let you get results from much less data than neural network models 4. They can give you uncertainties for the forecasts and parameters. 5. They are faster to fit than typical MCMC methods so you get to make faster inferences and more iterations on models with different simplifying assumptions before dealing with the full complexity of your problem. 6. They can be easily updated online with new data, making them suitable for online learning and real-time applications. 7. They can handle missing data and different levels interventions. 8. They are interpretable - you can easily view the changing parameters over time.  So why are DLMs challenging? 1. You don't get a nice ML style API and you don't get a nice report or even a bunch of diagnostics plots.  1. DLM generalize and utilize many other models. To use the their full power you should 1. have a good grasp not only of how these simpler models operate 2. how they are represented in the DLM framework and how 3. how they are integrated into one big model. While we cover the items 2. and 3 adequately, the course only goes into depth regarding AR($p$) models. For example, time varying regression component can fit multiple time series simultaneously are likely to be new. Also while ARMA is discussed we don't really cover it in depth or explain the DLM representation. - ARIMA is another model which can be represented in the DLM framework by using a polynomial trend for the integration and ARMA for the rest but this is not discussed. 2. Unlike AR($p$) or even ARMA($p,q$) we are not dealing with a model for which we set up a Bayesian model with some $\theta$ some priors for the thetas and pour in the data and get a posterior for the parameters, and a posterior predictive distribution for doing predictions.  3. DLM are hidden state space model. While we discuss this, we don't delve into the state representation, the nature of the markov property nor the long term distribution. This state space aspect is hard to reconcile with AR(p) and Seasonal components which have parameters that act like a long term memory or reference past values. So the state space representation is subtle and its exposition may leaves gaps.  5. DLM make use of Kalman filters in the filtering and smoothing operations this is a ~~complex~~ algorithm with many variants which deserves it own specialization and thus cannot be cover in this course. However, filtering and smoothing are essential operations for doing inference they have many moving parts that are also not covered in a systematic way unfortunately the full complexity appears and reappears whenever we consider the different settings for inference. - I think that a few video from the KF boot camp could be helpful. - The cheat sheet below outlining the names of the bayesian constructs in the Kalman filter recursions and their applications can dispel some of the confusion once it is memorized or made handy. 6. A number of results in the course are theoretical while others are practical but the boundary is never clear. As far as I can tell much of the inference material is not practical in nature but rather a sequence of derivations that lets us drop unrealistic assumption we must initially make about the model.  7. Similar to Neural Networks, working with DLMs require making informed choices regarding **architecture**, and choices of **hyperparameters**. Once these are made to get the DLM to work with requires that the dimensions of the internal matrices and that their sizes line up correctly. 8. Setting the Systems evolutions is neither straight forward nor as well understood as it is made to appear. E.g. for EEG data how well do we understand the hidden dynamics of the brain's response to an external shock? 9. The system equations allows us to introduce multiple hierarchical representations but these must be folded into a matrix using a very specific state space representation. One up side to DLMs Another attractive feature is that unlike neural models they can be interpreted and can be fit with a relatively less data. - Another facet is that the number of degrees of freedom and the number of parameters are not the same. - Architecture: - They can contain a polynomial trend that is smooth - They can contain AR($p$) components which can be jagged - They can contain MA($q$) components which requires an augmented-state construction. - They can encode seasonality using dummies (jumping patterns) - They can encode seasonality using sinusoids (complex smooth patterns) - They can also incorporate time indexed regressors - They can incorporate geo-spatial data but this is not covered in this course or the `R` package and is more of an extension of the DLM framework to [DSTM](https://spacetimewithr.org/) We can literally add the components into one big DLM. However here things get more interesting. Each component DLM has both an observational noise and a system noise. When we put the components into superposition In superposition there is one observation variance/covariance $V_t$ for the combined model; component-specific evolution variances live in blocks of $W_t$. So when we add components, we may need to customize the variances. For the combined model; component-specific evolution variances live in blocks of $W_t$. So if we add components, we need to customize the variances. A second bit that's tricky is the model's dimensions and how many of those are free variables. Filtering gives us estimates of the parameters we call put into the $\theta$ but we will also need to infer the two variances. Also in the [@prado2023time] book the authors point out that we will also want to infer some parameters of $F$ and $G$. Giving it some thought I think they are: - the degree for the polynomial trend components - the number of harmonics for the seasonal component - the p and q parameters for the ARMA components - which regressors and interaction terms to include in the regression component In other words we may well be interested in doing additional inference for the model selection. --- ### What Books are there on NDLMs? {#sec-faq-books} The first two books are hard to read but they also contain a slew of exercises. Like most mathematical textbooks you won't get more than 25% of the material unless you do enough of these exercises. I doubt you'll even memorize the recursion equations and their interpretations unless you do enough of these. P.S. unless you are in the middle of your PhD doing the exercises is very very taxing. **There are no solutions and even reading the derivation in my own solution was taxing**. With enough details I could be fairly sure I had a decent answer. But it could take a couple of hours or a couple of days to get there. I feel that to a large extent I would learn more making models than derivations. It is clear that to handle missing data, handle structural change points or intervention requires great facility with the with the Kalman filtering as well as setting up the data and model so that the DLM package can do its magic. This is well beyond the level of the course. That said lots of the material seems rather theoretical and I want to use NDLM in some sophisticated models. In hindsight I do believe this stuff isn't as complicated to the student. Other books on the subject seem to be from Econometrics or Ecology and other applied areas exist and are likely far more accessible. There is also more software in the wild that can facilitate things like finding structural changepoint. - [@west2013bayesian] which lays down the theory but is long winded. Often meanders rather than giving a clear and concise expiation. Perhaps the authors assume the readers will read it serval times during the PHD and a few more for their post doc. There are many gaps in this book - [@prado2023time] is much longer, more recent, far more advanced, covers many more models, tends to reference many research papers, poorly motivated and suffers from an even greater tendency to meander rather than provide the deep insights its maven of authors clearly possess. The book has many appendices that feel more like handouts on unrelated topics with one or two result that we might have used. - [@petris2009dynamic] Is the most accessible of the trio. Part of the excellent Use R! series which I've skimmed though years and years ago. The book takes a hands on approach to using the DLM library in R. It is a text that fills in some of the gaps in the two text above. However It isn't so easy to pick up NDLMs without background in Bayesian statistics and time series. The following titles are books I have not read but which I noticed during my research. - [@DurbinKoopman2012TSABSS] which is a classic. It is a very mathematical book that is hard to read but it does have a lot of exercises. It is a great book for those who want to understand the mathematical foundations of the Kalman filter and its applications in time series analysis. - [@harvey1990forecasting] is an excellent book that covers the Kalman filter and its applications in time series analysis. It is a more accessible book than Durbin and Koopman, but it is still quite mathematical. It is a great book for those who want to understand the Kalman filter and its applications in time series analysis. --- ### What is a Normal Dynamic Linear Model (NDLM)? {#sec-faq-ndlm} A Normal Dynamic Linear Model (NDLM), often simply called a Dynamic Linear Model (DLM) when normality is understood, is a class of dynamic models commonly assumed to have normal (Gaussian) distributions. It is characterized for each time `t` by a set of **quadruples $\{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\}$**. The core of an NDLM is defined by two sequential equations: * **Observation Equation**: $Y_t = \mathbf{F}_t^\top \theta_t + \nu_t$, where $Y_t$ is the observation vector, $\theta_t$ is the parameter (or state) vector, $\mathbf{F}_t$ is a known design matrix, and $\nu_t$ is an observational noise term assumed to be **normally distributed with zero mean and known variance matrix $V_t$** ($\nu_t \sim \mathcal{N}[0, V_t]$). * **System Equation**: $\theta_t = \mathbf{G}_t \theta_{t-1} + \omega_t$, where $G_t$ is a known system evolution and $\omega_t$ is a system noise term assumed to be **normally distributed with zero mean and known variance matrix $W_t$** ($\omega_t \sim \mathcal{N}[0, W_t]$). The error sequences ($\nu_t$ and $\omega_t$) are assumed to be internally independent, mutually independent, and independent of the initial information. NDLMs are widely used for modeling time series, capturing how processes change over time. Their flexibility and generality allow them to handle complex problems and directly quantify uncertainty, enabling the fitting of models with many parameters and intricate probability specifications. --- ### What are these **moments** we keep hearing about ? {#sec-faq-moments} Let's take a ~~moment~~ *time out* to unpack the moments reference. When we talk about filtering we have two or three equations called the filtering equations. These have a recursive form and are often either conditionaly or directly Normal or Student-t. And the first and second moments of these distributions are fed into the next update equations. They have names and interpretation which I might cover in another question. So that is the main usage of moments but we also have priors and other distribution and they also have moments and the professor might be talking about these ones as well. But the priors are the starting point of the recursive formulas so what I explained initially is a good place for an intuition. ::: {.callout-tip} #### moments cheat-sheet $$ \begin{aligned} \color{RoyalBlue}{a_t} &= G_t m_{t-1} & \text{state prior mean}\\ \color{RoyalBlue}{R_t} &= G_t C_{t-1} G_t' + W_t & \text{state prior var}\\[2pt] \color{Magenta}{f_t} &= F_t' a_t & \text{1-step forecast mean}\\ \color{Magenta}{Q_t} &= F_t' R_t F_t + V_t & \text{1-step forecast var}\\[2pt] \color{BrickRed}{A_t} &= R_t F_t / Q_t & \text{Kalman gain}\\ \color{ForestGreen}{m_t} &= a_t + A_t (y_t - f_t) & \text{state post mean}\\ \color{ForestGreen}{C_t} &= R_t - A_t A_t' Q_t & \text{state post var} \end{aligned} $$ {#eq-moments-cheatsheet} ::: --- ### Can you explain Inference in the NDLM We covered four cases in the notes, yet it is easier to miss the big picture. We saw several derivations of filtering, etc, with different settings for $v_t, \mathbf{W}_t$ 1. $v_t$ and $\mathbf{W}_t$ both known - (Normal conjugate structure) 2. $v_t=v$ and $\mathbf{W}_t$ - known (var and covar scaled by $V$ , $\mathcal{IG}$ prior for $v$ and Student $t$ for forecast and state) 3. $v_t=v$ known and $\mathbf{W}_t$ unknown/changing set via $\delta$ a discount factor 4. $v_t=v$ unknown and $\mathbf{W}_t$ unknown/changing set via $\delta$ a discount factor. As far as I can tell from the start of [@prado2023time Sec. 4.3] these are strong simplifying assumptions on the road to a more general case with $v_t=v$ unknown and $\mathbf{W}_t$ unknown. As best as I can tell, for our time series analysis, input is the same for all cases.... Can we also do away with the assumption of having constant observational variance? Isn't it a strong assumption for us to make? ### Can use Bayesian methods to infer the $F$ and $G$ of an NDLM ? It is too easy to get hung up on the word Bayesian here. I mean it would be neat if there was an algorithm figured out F and G from the data. It would be even nicer if there was an algorithm that could recover dynamics from the data. (Phase space reconstruction algorithms are possible in chaotic systems where trajectories are dense in the phase space but if the trajectory isn't chaotic we are out of luck) Anyhow we get the job as modelers to setup the model and make different assumption. For a Kalman filter we need system dynamics. Using DLM entails describing these via superposition of a trend, periodicity, ARMA($p,q$) and a time based regression. This is our choice regarding the inductive bias for our model and it is subjective. It reflects our view of the underlying process we are trying to model. To sum up: It is your job as a good Bayesian to make your assumptions explicit and to be aware of their implications. If you specify the model well you may imagine the busts if two exponents of subjective probability like de Finetti and Ramsey nodding at you in approval from their pedestals. And if you make poor choices you might hear the bust of Rudolf E. Kálmán having a fit as his filter chokes on your data. ### Isn't the NDLM over/under specified? {#sec-faq-overspecified} If $\{\mathbf{F}_t, \mathbf{G}_t, V_t, \mathbf{W}_t\}$ changes every time point $t$ i.e. is some or all of the components of the model change we are likely to have overfitting or underfitting problems (too many/few parameters compared to the data). The Kalman Filter performs filtering and smoothing and while these operations have optimality guarantees these stabilize only if they are permitted to converge to some limit (i.e. enough steps where enough might depend on $V_t, \mathbf{W}_t$) NDLM framework is very flexible, generalizing many known time series models in a single framework. In practice changes in the model at some index $t$ should be due to: - handle NA's, - handle intervention or - handle structural change point (These are covered in [@sec-faq-structural-change-point]). ::: {.callout-tip} #### Missing, interventions, and structural breaks (practical cookbook) {.unnumbered} **Missing $y_t$:** - Skip measurement update at $t$: keep $(m_t,C_t)=(a_t,R_t)$; continue with $t\!+\!1$ **Interventions (level shift, temporary shock, ramp):** - Add a regressor to $F_t$ with known design (step, pulse, ramp). - Give its state component a small $\delta$ (fast adaptation) or a spike-and-slab prior. **Structural break:** - Temporarily reduce $\delta$ (or inflate $W_t$) for the block governing level/slope. - Optionally reinitialize $(m_t,C_t)$ for that block. - For recurring regimes, consider **switching DLMs / Markov-switching LDS**. c.f. [@DurbinKoopman2012TSABSS §11.5] [@west2013bayesian Ch.11] ::: --- ### How much data do I need to fit an NDLM with k dimensions ?{#sec-faq-data-requirements} The amount of data required to fit an NDLM with $k$ dimensions depends on several factors, including the complexity of the model, the number of parameters to be estimated, and the desired level of precision in the estimates. In general, more data is needed for: 1. **Higher Dimensionality**: As the number of dimensions ($k$) increases, the parameter space becomes larger, requiring more data to obtain reliable estimates. 2. **Model Complexity**: More complex models with intricate structures (e.g., multiple state variables, non-linear relationships) typically require more data to accurately capture the underlying dynamics. 3. **Desired Precision**: If high precision is needed in the parameter estimates or predictions, more data will be necessary to reduce uncertainty. A common rule of thumb is to have at least 10-20 observations per parameter to be estimated. However, this is a not a good rule for DLMs. Three full seasons might be more appropriate, and the actual data requirements may vary based on the specific context and goals of the analysis. One recommendation is tuning $\delta$ by one-step predictive log score and checking standardized forecast errors. [@west2013bayesian ch. 6] --- ### How are filtering, smoothing, and forecasting performed in NDLMs? {#sec-faq-filtering-smoothing-forecasting} NDLMs provide a coherent framework for these key time series analyses: * **Filtering**: This process estimates the current state of the system ($\theta_t$) based on all available observations up to the current time $t$. In NDLMs, this is typically achieved using **Kalman filter recurrences**. The posterior mean of the state is a weighted average of the prior mean and the current observation, with weights proportional to their precisions. * **Smoothing (Retrospective Analysis)**: This involves estimating past states ($\theta_s$ for $s < t$) by incorporating all available data, including future observations up to a fixed interval $T$. It provides a retrospective view of the parameter values based on the entire dataset. Conditional independence results are crucial for developing efficient smoothing algorithms. * **Forecasting**: This involves predicting the future behavior of the system state and observations ($Y_{t+k}$, $\theta_{t+k}$) for $k$ steps ahead, given data up to the current time $t$. Forecast functions define the qualitative form and expected numerical development of the time series. These processes provide linear posterior means and variances, which is a justification for their use even outside strict normality assumptions. --- ### What are the computational challenges and methods used for NDLMs with unknown parameters? {#sec-faq-computational} When NDLM parameters, especially variances, are unknown and time-varying, obtaining exact analytical solutions for the full posterior distributions becomes complex. This necessitates the use of various computational techniques: * **Approximation Techniques**: * **Normal Approximation**: Often applied to posterior distributions, especially for parameters that are difficult to model directly. * **Linearization**: For models with non-linear components, linear approximations can be used to transform them into (approximate) DLMs, allowing for standard DLM analysis. * **Simulation-Based Methods (Markov Chain Monte Carlo - MCMC)**: * **General MCMC**: These methods are widely used for posterior inference in complex dynamic models. * **Gibbs Sampling**: A common MCMC approach, it is often easily implemented for sampling posterior distributions of model parameters and state vectors within a fixed time interval, especially for conditionally linear/normal models. * **Forward Filtering, Backward Sampling (FFBS)**: A specific and efficient MCMC algorithm introduced for sampling the full set of state vectors from the posterior distribution in conditionally Gaussian DLMs. It exploits the Markovian structure of the model. * **Particle Filters (Sequential Monte Carlo - SMC)**: These are crucial for "on-line" or recursive inference, where new observations frequently arrive and re-running an entire MCMC every time is computationally inefficient. Particle filters are particularly well-suited for non-linear and non-Gaussian state-space models. [Rao-Blackwellized particle filters](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/RBPF.pdf) can improve efficiency in conditionally Gaussian contexts. --- ### How are NDLMs specified, designed? {#sec-faq-specification} * **Model Specification and Design**: DLMs are often constructed by **superposition** (combining) two or more **component DLMs**, each capturing a specific feature like trend, seasonality, or regression. The starting point for model design is typically the desired **forecast function**, which determines the qualitative and quantitative form of the time series development. * **Hierarchical Models**: These are powerful extensions for problems involving multiple, related parameters. Data are modeled conditionally on parameters, which themselves are given a probabilistic specification in terms of hyperparameters. This structure allows "borrowing strength" across related groups or units, enhancing inference. --- ### How are NDLMs checked for adequacy? {#sec-faq-checking} * **Model Checking (Diagnostics)**: Essential for assessing how well the model fits the data and substantive knowledge: * **Posterior Predictive Checks**: Involve simulating replicated datasets from the model's posterior predictive distribution and comparing them to the observed data. For NDLMs, this often includes examining **standardized forecast errors** (`et/sqrt(Qt)`), which should resemble Gaussian white noise if the model is adequate. Graphical tools like QQ-plots and empirical autocorrelation functions are used to assess normality and uncorrelatedness. * **Sensitivity Analysis**: Involves recomputing posterior inferences under plausible alternative models to evaluate the robustness of conclusions to modeling assumptions. * **Model Comparison**: Competing models can be evaluated based on measures like **predictive accuracy** (e.g., log score), **information criteria** (e.g., AIC, DIC, WAIC), or **Bayes factors**. ::: {.callout-warning #box-diagnostics} #### Diagnostics checklist {.unnumbered} Use on standardized forecast errors Let $r_t = e_t/\sqrt{Q_t}$. - **White noise:** ACF/PACF of $r_t$; Ljung–Box on $r_t$ (not raw residuals). - **Normality:** QQ-plot of $r_t$; heavy tails ⇒ variance discounting or robust component. - **Calibration:** coverage of 1-step predictive intervals; PIT histogram. - **Predictive score:** rolling log score or CRPS for model comparison. - **Breaks:** spikes/variance jumps in $r_t$ ⇒ local $\delta\downarrow$ or add intervention regressor. ::: --- ### What is the difference between DLM and NDLM? {#sec-faq-dlm-vs-ndlm} DLMs are the subject of [@west2013bayesian]. Though most of the time they are actually discussing NDLMs, which are a special case of DLMs in which the priors and the variance terms come from a Normal distribution. We use NDLM to emphasis this normality assumption which simplifies the analysis letting us replace Bayes rule in derivation with powerful results from *Normal theory* and allows for the use of conjugate priors, making the Bayesian updating process more straightforward. DLM work well with any member of the exponential family but these models are not as well explored as the Normal ones. ### Why are $\mathbf{F}_t$ and $\mathbf{G}_t$ a vector and a matrix respectively? {#sec-faq-Ft-Gt} It may helps to think about $\mathbf{F}$ and $\mathbf{G}$ as follows: If we start with $\mathbf{G}_t$ we see it is a linear transformation that describes the dynamics of state vector evolves over time. I like to think about it as a Hidden Markov state transition matrix. And once we have the updated state $\mathbf{F}_t^\top$ acts as a linear transformation that maps the latent state $\vec{\theta}_t$ into the observation space, of $y$. While $\nu_t$ injects some observation noise. In the state evolution equation $\theta_t = G_t\theta_{t-1}+\omega_t$ we pre-multiply out $\theta_t \mathbf{G}_t$ to deterministically update the state and we then add $\omega_t$ to account for process noise. In other words, $\mathbf{F}_t$ takes the current hidden state $\theta_t$ and produces an observation $y_t$, while $\mathbf{G}_t$ takes the current state and produces the next state. --- ### Why is a DLM called a linear model? {#sec-faq-linear-model} This is because both the observation equation is a linear equation that relates the observations to the parameters in the model and the system equation is a linear equation that tells us how the time-varying parameter is going to be changing over time. This is why we call this a linear model. --- ### Why are the noise terms $\nu_t$ and $\omega_t$ assumed to be normally distributed? {#sec-faq-noise-normal} This is a common assumption in time series analysis. It is a convenient assumption that allows us to perform Bayesian inference and forecasting in a very simple way. And this is why we call this a **normal** dynamic linear model. ### Isn't this just a hierarchical model? {#sec-faq-hierarchical} It is a hierarchical model but **not just**. First the observation and system evolution equations are also auto-recursive giving them a temporal structure. We have a model for the observations and a model for the system level. The system level is changing over time and the observations are related to the system level through the observation equation. As explained above $G$ is a matrix i.e. a set of simultaneous equations and these may capture a hierarchial, multilevel or other structures. We saw in the development of the p order polynomial trend model that we can add p levels to the evolution equation. And so it is possible to extend this model to more complex structures if we wish to do so by adding another level, etc... However as we add more level they must be written in a representation that the KF algorithm can process. This means we will take all these levels and fold them into $G$ and keep the temporal structure of the two level overall framework! One more thought on structure is that we can combine different dlm into bigger one using stacking. This isn't something we considered before for hierarchical models so again **not just**. --- ### What is the difference between NDLMs and AR(p)/ARIMA model? {#sec-faq-ndlm-arp} NDLMs are built of component and one of the components can be an AR($p$) model. AR($p$) models need to be stationary but NDLM have no such requirement. Note that what I said above regarding AR($p$) applies to an ARMA component which is a more general model than AR($p$). - Intuitively though the NDLM can have seasonal and trend components and if they do account for the non stationary part of the series then the AR($p$) might account for the stationary residual. TODO: It is unclear that this does happen nor that there are guarantees that the algorithms will do this. --- ### What are moments for NDLM? {#sec-faq-moments-ndlm} The instructor and the book references parts of the NDLM model as moments what is that about? We just said in [Q1]{#sec-dlm-vs-ndlm} that NDLM posit a normal structure on the priors and errors. When they talk about the moments they mean the mean and variance in these distributions. These are quantities of interest. The priors obviously are known. The errors are not generally unknowns To make things a bit clearer, the Kalman filter is optimal in some sense at assigning the state of the system at time $t$ given the data up to time $t$. The state is what we call $\theta_t$ what the Kalman filter can't eliminate are the impact of the variance in the system and the observation level. There is always an error. However there are theoretical guarantees that the Kalman filter will provide the best linear unbiased estimate (BLUE) of the state so long as sufficient data has been seen. The moments are the inputs and outputs of the model. We get we propagate posterior means/variances; $𝑚_t$ is the posterior mean, not an MLE. But we will get a posterior for the variance of the system and the observation level. This is the other moment of the model and it is here that we actually need to make strategic decisions. We can set a complex prior and get many measurements to get a good posterior for the variance or more likely we don't really know much about the errors and we prefer to postulate something as simple as possible and then use the data to get a posterior for the variance. (Simple here means a model that only requires to compute and interpret the noise at the observation level i.e. a difference between the model's forecast and the actual observation we see next) --- ### What is this thing called $\delta$ can I ignore it? In the NDLM where we don't know the system variance we can replace it under a simplifying assumption by decomposing $R_t$. In the filtering equations. I think of this as providing us with a "surrogate" model. I.e. a simpler model that approximate out original model. What we do is we decompose the system model into a deterministic part and a stochastic part. We update the covariance using a `reduced form` which means we have a raw estimate of covariance based on how the previous times system error term evolves according to $G$ i.e. $G C_t G^\top$ and we use $\delta$ as a weight to set how much of that term we want to pass through. So to sum up: we can use the discount factor hyper-parameter denoted as $\delta$ which we learned to estimate using MSE on a loss associated with $\delta$. This question is an informal outline of [@specifying-the-system-covariance-matrix-via-discount-factors] --- ### What is the Kalman gain The Kalman gain is a key component of the Kalman filter, which is used in NDLMs to update the state estimates based on new observations. It determines how much weight to give to the new observation relative to the current state estimate. --- ### How does the Kalman Filter features in DLMs {#sec-faq-Kalman-filter} The Kalman filter is an iterative algorithm driving the NDLMs. It is used for estimating the hidden states of the model and updating these estimates as new observations become available. But Kalman Filters require four matrices to do their magic and the NDLM code handles putting everything into a form which is compatible with the Kalman filter. Unfortunately, [the Kalman filter has a tendency to amplify noise]{.mark}, if the $V_t$ is underestimated. Which can be problematic in practice. This means that if the model is not well-specified or if the noise characteristics change over time, the Kalman filter may produce unreliable estimates. --- ### What is a structural change point. {#sec-faq-structural-change-point} This is a point in the time series where the underlying data generating process changes in a way that cannot be captured by the existing model structure. This can happen due to various reasons such as: - changes in external conditions (e.g., economic shifts, policy changes) - changes in the behavior of the system being modeled (e.g., a sudden change in consumer behavior) - introduction of new variables or factors that were not previously considered. As far as modeling the system, we need to revise our parameters, or if the change to the process is more radical we may need to incorporate additional parameters, by adding a component to the DLM. There is a lot of criticism of the Facebook prophet algorithm breaking at some point. Digging deeper this is due to a distributional drift or more likely even a structural change point. FB prophet is - much simpler than NDLM and - uses Stan for MCMC and doesn't use KF for its inference and - its regular users lack the ability to modify its internal components. So it is a huge problem to fix FB prophet if it blows up in production and is used in a recommendation system. Retaining FB prophet may not work. In contrast DLM theory emphasizes that DLMs are open to at any time point to changes at all levels. (Without worrying the student how in practice they are supposed to do this or how robust the model is to such changes) After all the Kalman filter is fantastic at using feedback to give an optimal updates. But it probably just as hard to handle a structural change, intervention or NA. I think that the extra code we got in class which takes or creates list of matrices indexed in time is exactly an NDLM in short notation that handle this. It then becomes a matter of practice in inserting NAs, interventions and adding or changing components from a certain point.  The basic structure of DLM however does allow us to handle structural change points simply by updating the system evolution matrix $G_t$ and the observation matrix $F_t$ at any time we need to handle a structural change point. The challenge is that 1. We need to detect the change point in the underlying process. (Using data drift detection algorithms) 2. We need to know how to incorporate this change to the model. (This is a modeling challenge that requires a case by case analysis and usually can only be done in retrospect). In the past few decades there has been a lot of interest for example in econometrics for models that handle regime changes for volatility which has spawned a whole research area and a family of models starting with GARCH, EGARCH, TGARCH, CGARCH, FIGARCH, HYGARCH, etc... If we realize ahead of time that our DLM may tend to switch between different regimes we can actually incorporate this into our model. There are state space models that incorporate identifying and switching between different models. These are called **Markov switching models** (MSMs) and they are a generalization of NDLMs. MSMs can be used to model time series with structural change points, but they are more complex and require more data to estimate the parameters. These models are covered briefly in [@davidson2015bayesian] and in full detail in [@frühwirth2006finite]. --- ### What does Polynomial mean in a Polynomial trend DLM ? {#faw-faq-polynomial} I was confused about this and there are three good reasons to be! 1. The AR($p$) has a **characteristic polynomial which is unrelated** to do with the Polynomial in these trend models. They are AR($p$) component of DLMs which create jagged forms in the forecast function, while polynomial models are popular as they add a smooth trend. The polynomial model is a sub-model that covers the trend. This allows us to model the residual as an AR($p$) even if the data is non-stationary 2. In [@west2013bayesian] the authors talks about Taylor series approximation before delving into the first polynomial trend model. This is an intuitive way to think about about multiple regression model. This is a polynomial in which we can pick the order one or two to get better approximations lingo common in physics and numerical methods. However it is **completely unrelated** to the Polynomial that gives the model its name. For a polynomial model of order $p$, when we multiply out the terms of the forecast function we get a polynomial of order $p-1$. --- ### Can NDLMs handle unknown and non-constant observational ($V_t$) and system ($W_t$) variances? {#sec-faq-unknown-variances} [NDLMs **can** be extended to handle cases where both observational variance ($V_t$) and system variance ($W_t$) are unknown and vary over time]{.mark}. This moves beyond the simplest Kalman filter assumptions of known variances. Here's how this is typically managed: * **Time-Varying System Variance ($W_t$)**: * The most common and practical approach is through **discount factors ($\delta$)**. A discount factor defines $\mathbf{W}_t$ as a proportion of the previous time step's prior covariance, effectively $R_t=\delta^{-1}G_tC_{t-1}G_t',\qquad W_t=\frac{1-\delta}{\delta}\,G_tC_{t-1}G_t$. This allows $\mathbf{W}_t$ to be automatically time-varying and adaptive, simplifying the specification of complex covariance elements to a single scalar. Different discount factors can be applied to different components of the state vector. * **Time-Varying Observational Variance ($V_t$)**: * This is handled through **variance discounting** or **discounted variance learning**. This technique models a decay of information about the observational precision ($\phi_t = \frac{1}{V_t}$) over time, maintaining the conjugate Gamma distribution form for precision. The prior for $\phi_t$ at time $t$ is derived by discounting the degrees of freedom and scale parameter from the previous posterior (e.g., $G[\delta n_{t-1}/2, \delta d_{t-1}/2]$ ). This makes the variance estimate more adaptive to recent data. The concept of "power-discounting" is also mentioned in relation to modifying the prior distribution, suggesting a general method for flattening distributions, which can be applied to precision parameters. * **Multivariate Extensions**: These discounting approaches extend to multivariate DLMs. For instance, **matrix normal/Inverse Wishart distributions** can be used to handle time-varying observational covariance matrices ($\Sigma_t$), often with dynamics defined by a **matrix beta evolution model**. ::: {.callout-tip title="State discounting ($\delta$) for $W_t$" #box-state-discount} For scalar $\delta\in(0,1]$ applied to the **state evolution**: $$ \begin{aligned} R_t &= \delta^{-1}\,G_t C_{t-1} G_t' \\ W_t &= R_t - G_t C_{t-1} G_t' \;=\; \tfrac{1-\delta}{\delta}\,G_t C_{t-1} G_t'. \end{aligned} $$ **Usage.** - Smaller $\delta$ ⇒ larger $W_t$ ⇒ faster adaptation. - Use **block discounts** (different $\delta$) per component of $\theta_t$. - For breaks/interventions: temporarily set $\delta\!\ll\!1$ on affected blocks. [@west2013bayesian Ch.6] [@prado2023time §4] [@DurbinKoopman2012TSABSS §2] ::: ::: {.callout-note title="Variance discounting ($\beta$) for observation variance $V_t$" #box-var-discount} Discount the **precision** $\phi_t=V_t^{-1}$ prior to keep $\mathbb E[\phi_t]$ fixed while inflating uncertainty. If $\phi_{t-1}\sim\mathrm{Gamma}(a_{t-1}, b_{t-1})$ (shape–rate), set $$ \phi_t\mid\mathcal D_{t-1}\sim\mathrm{Gamma}(\beta a_{t-1},\; \beta b_{t-1}), \quad \beta\in(0,1]. $$ Then $\mathbb E[\phi_t]=a_{t-1}/b_{t-1}$ (unchanged) and $\mathrm{Var}(\phi_t) = (1/\beta)\,a_{t-1}/b_{t-1}^2$ (inflated). **Effect.** Forecasts remain Student-$t$; recent data get more weight. Refs: [@west2013bayesian §10.8] [@prado2023time §4] ::: --- ### What are some common extensions and generalizations of NDLMs? {#sec-faq-extensions} The DLM framework is highly flexible and can be extended in various ways: * **Non-Normal and Non-Linear Dynamic Models**: * **Dynamic Generalized Linear Models (DGLMs)**: Extend DLMs by using exponential family distributions (e.g., Poisson for count data, Binomial for proportions) for the observational model, often involving non-linear link functions. * **General Non-Linear Dynamic Models**: Arise when parameters (e.g., `λ` in a transfer response function or a discount factor itself) introduce non-linearities into the system or observation equations. * **Stochastic Volatility (SV) Models**: Often formulated as non-linear/non-Gaussian state-space models where volatility parameters evolve dynamically, requiring specialized computational methods. * **Mixture Models**: Can be incorporated to handle non-normal error distributions or to model phenomena like occasional outliers. * **Multivariate and Matrix Normal DLMs**: * **Multivariate DLMs**: Generalize to handle vector-valued observations, allowing for joint modeling of multiple time series. * **Matrix Normal DLMs**: Provide a framework for multivariate time series analysis where the covariance structure across series is unknown, leveraging matrix-variate normal distributions for fully conjugate analyses. * **Dynamic Graphical Models**: Combine matrix-variate DLMs with Gaussian graphical models, allowing for structured and often sparse precision matrices, which is useful for scalability in high-dimensional time series. * **Dynamic Dependence Network Models (DDNMs)**: These models define multivariate dynamic models by coupling customized univariate DLMs, extending time-varying vector autoregressive (TV-VAR) models and allowing for flexible modeling of time-varying parameters and volatilities. * **Spatio-Temporal Models**: NDLMs form the foundation for **dynamic spatio-temporal models (DSTMs)**, which model processes that vary across both space and time. These models can also incorporate non-linearity and non-Gaussian elements. Hidden Resolution Models (HRMs) are a type of multiscale time series model that can be formulated as DLMs. --- ### The Normal Dynamic Linear Model: Definition, Model classes & The Superposition Principle Dynamic Linear Models (DLMs) extend classical linear regression to time-indexed data, introducing dependencies between observations through latent evolving parameters. A Normal DLM (NDLM) assumes Gaussian noise at both observation and system levels, enabling tractable Bayesian inference through the Kalman filter. While superficially complex, NDLMs are conceptually close to linear regression. Instead of I.I.D. observations indexed by $i$, we index data by time $t$ and allow parameters to evolve with time, resulting in a two-level hierarchical model. At the top level is the observation equation. Below this there is the evolution equation(s) that can be understood as a latent state transition model that can capture trends, periodicity, and regression. The evolution equations can have more than one level however we will see that with some work these are summarized into a matrix form. To make things simpler this is demonstrated using a white noise process and then a random walk model. What makes the NDLM somewhat different is that that there are two variance elements at two levels, necessitating learning more parameters. Once we cover these two models the instructor walks us though all the bits and pieces of the notation. Later we will see that we can add trends, periodicity, regression components in a more or less systematic way. However we need to pick and choose these components to get a suitable forecast function. This approach require an intimate familiarity with the data generating process to model. This approach is Bayesian in that we draw our parameters from a multivariate normal and use updating to improve this initial estimate by incorporating the data and we end up with a posterior i.e. we have distributional view of the time series incorporating uncertainties. Additionally we have a number of Bayesian quantities that can be derived from the model, such as - the **filtering distribution** that estimates the current state $\mathbb{P}r(\theta_t \mid \mathcal{D}_t)$, - the **forecasting distribution** - to predict future observation: $\mathbb{P}r(y_{t+h} \mid \mathcal{D}_t)$, - the **smoothing distribution** - retrospective estimate of past state: $\mathbb{P}r(\theta_t \mid \mathcal{D}_{T})\quad t<T$ and - the **forecast function** when $F_t=F$ and $\mathbf{G}_t=\mathbf{G}$ $f_t(h)=\mathbb{E}[y_{t+h} \mid \mathcal{D}_{T}] = F'G^h \mathbb{E}[\theta_{t} \mid \mathcal{D}_{T}]$ - the usual credible intervals for forecasts and parameter estimates. However the DLM framework is quite flexible and once you understand it it can be adapted to support features like seasonality using the superposition principle. NDLMs don't need to be non-stationary time series. As far as I cen tell NDLMs are just DLM with their errors distributed normally at the different levels.