95 Bayesian Inference in the AR(p) - M2L5

95.1 Bayesian inference in the AR(p): Reference prior, conditional likelihood 🎥

95.1.1 Model Setup

we start with an AR(p) model as we define in the previous lesson: y_t = \phi_1 y_{t-1} + \ldots + \phi_p y_{t-p} + \varepsilon_t \qquad \varepsilon_t \stackrel{iid}{\sim} \mathcal{N}(0, \nu) \tag{95.1}

We wish to infer:
- \phi_i the AR(p) coefficients and
- \nu the innovation variance.

95.1.2 Conditional likelihood

We make use of the autoregressive structure and rewrite y_t conditionally on the previous p values of the process and the parameters:

(y_t \mid y_{t-1}, \ldots, y_{t-p}, \phi_1, \ldots, \phi_p) \sim \mathcal{N}\left (\sum_{j=1}^{p} \phi_j y_{t-j}, v\right ) \tag{95.2}

We condition on the first p values of the process. The conditional distribution of y_t given the previous p values and parameters is normal with mean given by the weighted sum \sum \phi_j y_{t-j} and variance v.

the density for the first p observations is given by: \mathbb{P}r(y_{(p+1):T}\mid y_{1:p}, \phi_1, \ldots, \phi_p, v) = \prod_{t=p+1}^{T} \mathbb{P}r(y_t \mid y_{t-1}, \ldots, y_{t-p}, \phi_1, \ldots, \phi_p, v) \tag{95.3}

This product of conditionals yields the full conditional likelihood. Each term is Gaussian and independent, given the past values and parameters.

95.1.3 Regression Formulation

This is recast as a linear regression: response vector \mathbf{y} starting from y_{p+1} to y_T, design matrix \mathbb{X} built from lagged values, and \boldsymbol\beta as the AR coefficients \phi_j.

\mathbf{y}= \mathbb{X}\boldsymbol \beta + \boldsymbol \varepsilon \qquad \varepsilon \sim \mathcal{N}(0, vI) \tag{95.4}

\mathbf{y} = \begin{pmatrix} y_{p+1} \\ y_{p+2} \\ \vdots \\ y_T \end{pmatrix}, \quad \boldsymbol \beta = \begin{pmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_p \end{pmatrix}, \quad \mathbb{X} = \begin{pmatrix} y_{p} & y_{p-1} & \cdots & y_{1} \\ y_{p+1} & y_{p} & \cdots & y_{2} \\ \vdots & \vdots & \ddots & \vdots \\ y_{T-1} & y_{T-2} & \cdots & y_{T-p} \end{pmatrix}

In the design matrix \mathbb{X} each row corresponds to lagged observations used to predict the next value. This setup enables applying linear regression machinery.

Assuming full-rank \mathbb{X}, we may now infer the parameters by using the left generalized Moore-Penrose inverse of \mathbb{X} as the maximum likelihood estimator (MLE) of the AR coefficients \boldsymbol \beta.

\boldsymbol \beta_{MLE} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{y} \tag{95.5}

where:

\mathbb{X} is the design matrix,
\boldsymbol \beta is the vector of AR coefficients, and
\mathbf{y} is the response vector.

It matches the usual OLS solution in linear regression.

95.1.4 Reference prior and posterior distribution

The reference prior reflects non-informative beliefs. This yields a conjugate posterior due to the Gaussian likelihood.

p(\beta,v) \propto 1/v \tag{95.6}

(\boldsymbol \beta \mid \mathbf{y}_{:T}, v) \sim \mathcal{N}(\boldsymbol \beta_{MLE}, v (\mathbb{X}^\top \mathbb{X})^{-1}) \tag{95.7}

The posterior is Gaussian with mean \boldsymbol \beta_{MLE} and scaled covariance v(\mathbb{X}^\top \mathbb{X})^{-1}, analogous to the OLS variance.

(v \mid \mathbf{y}_{:T}) \sim \text{Inverse-Gamma}\left(\frac{T - 2p}{2}, \frac{S^2}{2}\right) \tag{95.8}

where:

S^2 = \sum_{t=p+1}^{T} (y_t - \mathbb{X} \boldsymbol \beta_{MLE})^2 is the unbiased estimator of the variance v.
T is the total number of observations.

Posterior for v is inverse gamma. The shape parameter is (T - 2p)/2 because the number of residual degrees of freedom is T - 2p (residual = T-p obs minus p params). Scale is half the sum of squared residuals.

95.1.5 Simulation-based Inference

Posterior sampling:

Draw v from the inverse gamma posterior.
Plug v into the posterior of \boldsymbol\beta and sample from the Gaussian.

This gives full Bayesian posterior samples of (\boldsymbol\beta, v).

95.1.6 Summary

In this lesson, we discussed the Bayesian inference for the AR(p) process using the conditional likelihood and reference prior.
We have established a connection between the AR(p) process and a regression model, allowing us to use standard regression techniques to estimate the parameters.
The posterior distribution can be derived, and we can perform simulation-based inference to obtain samples from the posterior distribution of the AR coefficients and variance.

Video Transcript

We can now think about Bayesian in the general case of an AR(p) process. In the Bayesian in case, we have two choices to make first what kind of likelihood we are going to be using and then what kind of prior distribution we’re going to be using for the model parameters. I’m going to discuss the case in which we use the conditional likelihood. We have to condition this time on the first p values of the process. And then I’m also going to discuss the case in which we use the reference prior for the model parameters. As in the AR(1) case, there is a correspondence between that conditional likelihood setting and the regression model. So recall we can write down the process, Like this. So, And the assumption is that the epsilon t are iid normally distributed random variables with zero mean and variance v. So that’s the variance of the process. The parameters of the process are going to be all the phi’s, And the variance of the process. So we want to make inference on those parameters. What we know given the conditional dependency structure and past values of the series, is that if I think about the distribution of yt, given yt -1 all the way to y t-p. And then also given the phi’s, And v, this is a normally distributed, Random variable with the mean is going to be the sum, Of these. Past values weighted by the phi’s, And then I have variance v here. So now that I use this conditionally independent structure, I can write a likelihood function as the product of these terms. So if I think about my density here for y p+1 all the way to capital T. So let’s assume that we have capital T, we have observed capital T. Data points and then we are conditioning on y1. And p, all the first p observations and all the phi’s again and v. I can write this as the product of all these terms. All these are are going to have a normal density given by this expression. So as we said before, there is a correspondence between this likelihood function, the conditional likelihood and a regression model of the form y=Xbeta +epsilon. Where y here is a vector, X matrix, beta is a vector and epsilon is another vector with this vector being multivariate normal. And then we’re going to have a v times the identity. So I can write down this expression and make a a connection with this model by simply setting y. Again, because I’m conditioning on the first p values. I’m going to start with yp+1, yp+2 and so on all the way to yT. Then here my beta is the vector that goes with the coefficients. In this case, the linear component has to do with the phi’s. So it’s going to be my beta with all the phi’s. And those are the AR parameters that we want to estimate. And then I’m going to have an X matrix. The design matrix here for the linear regression in the case of an AR(p), again, if I think about the first row is going to be related to the yp+1. So we are regressing on the past p values. So it’s going to go from yp, All the way, so that’s the first row, to y1. Then for y p+2, I’m going to have something similar and then I go all the way down to yT. And so I’m going to have yT -1, -2 all the way to yT-p. So this is my X matrix. So as you know, I can find my maximum likelihood estimator for this process. Which I’m just simply going to call beta hat, its going to be assuming that the design matrix is full rank. We can write it down like this. So we can simply just plug in this X and this y vector to obtain the maximum likelihood estimator of the model parameters for the AR coefficients. And then using these results, we can also write the posterior distribution using a reference prior. So again, this gives me the likelihood. The reference prior assumes that we are going to use a prior of this form. And in this case, the Bayesian inference can be done by writing down the density of the beta given all the y’s and v. This is going to be normally distributed. The mean it’s going to be beta hat, which is the maximum likelihood estimator. And then I have my v times X transpose X inverse. And then I have my marginal distribution for v given all the observations here, it’s going to be an inverse gamma. And then the parameters for the inverse gamma distribution are going to be related to, again, you think about what is the dimension of the vector here y. In the case of the AR(p) I have something that is dimension T- p. So that’s the first dimension that we are going to consider. But then we also have to subtract the dimension of the beta vector which is p. So I’m going to have T -p-p, which gives me my 2p here. And then I have the other component is my s square over 2. So once again, if I want to do simulation based posterior inference, I can simply get a sample of v from this inverse gamma distribution. And then I plug in that sample here and I get a sample for the AR coefficients. So I can get full posterior inference in this way.

95.2 R code: Maximum likelihood estimation, AR(p), conditional likelihood 🗒️

  set.seed(2021)
# Simulate 300 observations from an AR(2) with one pair of complex-valued reciprocal roots 
r=0.95
lambda=12 
phi=numeric(2) 
phi[1]=2*r*cos(2*pi/lambda) 
phi[2]=-r^2
sd=1 # innovation standard deviation
T=300 # number of time points
# generate stationary AR(2) process
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) 

## Compute the MLE for phi and the unbiased estimator for v using the conditional likelihood
p=2
y=rev(yt[(p+1):T]) # response
X=t(matrix(yt[rev(rep((1:p),T-p)+rep((0:(T-p-1)),rep(p,T-p)))],p,T-p));
XtX=t(X)%*%X
XtX_inv=solve(XtX)
phi_MLE=XtX_inv%*%t(X)%*%y # MLE for phi
s2=sum((y - X%*%phi_MLE)^2)/(length(y) - p) #unbiased estimate for v

cat("\n MLE of conditional likelihood for phi: ", phi_MLE, "\n",
    "Estimate for v: ", s2, "\n")


 MLE of conditional likelihood for phi:  1.65272 -0.9189823 
 Estimate for v:  0.9901292

95.3 Model order selection 🎥

95.3.1 Goal

Determine the best model order p for an AR(p) process by comparing candidate models.

95.3.2 Step 1: Define a candidate set of model orders

p^* \qquad p=1:p^*

You choose a maximum order p^* (e.g., 20), then evaluate AR models of each order p \in {1, \dots, p^*}.

95.3.3 Step 2: Estimate residual variance for each model

S^2_p \qquad y_{(p^*+1):T}

For each model of order p, compute S_p^2 as the residual sum of squares (RSS) using a fixed evaluation window: t = p^*+1 to T, to ensure all models are evaluated on the same data subset.

95.3.4 Step 3: Compute model selection criteria

AIC balances fit (log-likelihood) with complexity (number of parameters). The first term measures model fit using \log(S_p^2), and the second term penalizes complexity with 2p.

AIC_p = (T-p^*) \log(S_p^2) + 2p

BIC uses the same fit term but a heavier penalty: p \log(T - p^*), which increases with sample size. This often leads BIC to favor smaller models than AIC.

BIC_p = (T-p^*) \log(S_p^2) + p \log(T-p^*)

Video Transcript

One of the big questions when you’re working with autoregressive processes is what model order you should use for your particular data. One thing that you can do is you can look at the sample ACF and PACF of the data that you have and try to begin with a model order that reflects, the behavior that you see in those sample ACF’s and PACF’s. But you can also have a more formal approach in which you consider using model order selection criteria. Model selection criteria to pick the model order. You could even also be more sophisticated and assume that the model order is also uncertain and consider that as a random variable. Put a prior on that random variable, and then perform Bayesian inference. I’m going to just discuss here how we can use two model selection criteria. One is the AIC, Akaike’s Information Criterion, and the other one is BIC, to choose the model order. These are usually implemented in software packages in R and you can use them in practice for your given data. The first thing is if you’re working with the conditional likelihood, you can also work with the full likelihood. But if you’re working with the conditional likelihood, you have to evaluate these model selection criteria using the same data. Let’s say that you have a set candidate of models. You have to pick a maximum model order, I’m going to call that p-star. For example, you could have p-star to be 20. Then you’re going to consider all the possible orders from 0-20. So your model orders that you will be considering go from one or from zero all the way to p-star. Here for each of those model orders that you consider, if you’re working with a conditional likelihood, you’re going to have estimates of the model parameters. You’re going to have your phi hats using maximum likelihood estimation. For each of those model orders, you’re going to have Sp, I’m going to call it Sp square. It’s your S square that you get from each of those model orders. Now p here, I’m putting that subscript there just to indicate that this is related to the model with model order p. Then you can evaluate everything using your data and your data for evaluation here is going to go from p star plus 1 all the way to T. It’s important that you just use this data when you’re computing your regressions so that you evaluate everything on the same set of data points. You can write down the Akaike’s Information Criterion in this case, is a function that is going to have two components. One has to do with how good is the fit of the model. The other one is penalizing the number of parameters you have in the model. Usually in these model selection criteria, we have those two components. You can write this down as the number of observations you have, which is, again, we’re conditioning on the first p star observations. We’re only going to evaluate this on the Data starting from p star plus 1 up to T. Then you’re going to have the log of that Sp squared. That’s the part of the AIC that has to do with the goodness of fit. Then there is a penalty for the number of parameters. In this case, AIC uses a penalty of two times p. What you do is you fit all these model orders. It can go from one to p star, from zero to p star if you want to incorporate the white noise component as well. Then you just get an AIC for each of these, and then you compare all of those. There is going to be essentially a value for each of these p. Then you look at the optimal value is the one that minimizes that AIC expression. You look at those values, you pick the model order that minimizes the AIC. You can also use BIC. In the BIC, you’re going to have the same component here related to the goodness of fit. You’re going to have the same piece. But now BIC is going to penalize the number of parameters in the model in a different way. You’re going to have something that looks like p log of the T minus p star. Here we had a penalty. We had two times the number of parameters. Here we have the number of parameters times something that depends on the number of observations you have. It’s a different penalty. You may get different results again here you evaluate your BIC for each of the model orders you are considering, and then you pick the model order that minimizes this expression. As you can see, what happens is there is a balance usually between the more parameters you consider, the better is going to be your model fit. But you have that penalty that you are overfitting, you maybe overfitting, you have too many parameters in your model. These model selection criteria try to balance those components and give a penalty for the number of parameters. When you run things in practice, you may get the same model order, the same optimal model order for AIC or BIC, or you may get different numbers. You can also look at other quantities. You can look at posterior predictive distributions and see how well the model does in terms of those posterior predictive densities. You can, as I said before, think about considering p as another variable in your model. We will stay simple here, and we will consider just these model selection criteria.

95.4 Example: Bayesian inference in the AR(p), conditional likelihood 🎥

This video walks through the code in the next section which demonstrates Bayesian inference for an AR(2) process with complex-valued roots.

Video Transcript

We now talk about how to obtain maximum likelihood estimation and Bayesian inference using the conditional likelihood function in this case of the AR2 process. In this particular example is the a AR2. Here I’m first again, just sampling from a specific AR2 process so that we have the data here. I’m using an AR2 with one pair of complex reciprocal roots. The modulus of the reciprocal root is 0.95 and the period is 12, so we’re going to observe this quasi-periodicity in the sampled time series. Then these are 300 observations, the standard deviation of the innovation is one, and we are using the arima.sim function as before. As you recall, in the case of the conditional likelihood we have a correspondence between the conditional likelihood and the equations of the linear regression model. Here I’m just applying the equations we discussed before and we can write everything in terms of a regression model in vector form. So we have a response y and X design matrix and then we obtain, just reading in, so I’m looking at as a conditional likelihood so I’m going to condition on the first P observations. P in this particular case is two and I have my X matrix and my maximum likelihood estimator for Phi is just obtained using the equation X transpose X_inverse times X transpose y. This is my MLE for the AR coefficients, and then I have the unbiased estimator for v which we called S2 here using again, the equations that we discussed before. If you look at the estimates, again the AR coefficients, the true values of the AR coefficients, we can compare with the estimated AR coefficients so the true values are here in the Phi and the true value is for the AR coefficient in the first lag. The Phi 1 is about 1.65 which is what we obtain also in the maximum likelihood estimate using the conditional likelihood, and then the second coefficient is around negative 0.9 which is again close to what we get in terms of the estimate of the maximum likelihood estimator. s square is an estimate for v and here the true variance is one. Again this is based estimator using all these is based on these data that we simulated. Now we can just run in the case of the AR2 we can obtain posterior inference using these conditional likelihood and the reference prior, and in that case we can do everything using direct sampling. I’m going to use the library MASS here just to use the functions to sample random variables from that library. If you recall the form of the posterior distribution is such that the marginal for the variance given the data is just an inverse Gamma distribution so we can sample from this inverse Gamma using the function rgamma and then taking the inverse of that with the corresponding parameters that we have discussed again, the equations before. In the step 2, conditional on v, we can sample Phi from a normal distribution in this case is a multivariate normal bivariate in this case since we have an AR2, and here I’m just simply obtaining samples from that distribution. I set the number of samples to be 1,000. You can choose as many or as few posterior samples as you want. Then based on these we can look at posterior estimates in terms of graphical representations of those like we can look at these histograms of the samples for Phi and the variance. Here we can see if we zoom in here I have three histograms, the first one is the histogram that corresponds to the samples of the first AR coefficient Phi_1, the red line corresponds to the true value and this is just the posterior distribution for Phi_1. Similarly this histogram gives me representation of the posterior distribution for Phi_2 based on those 1,000 samples from the posterior distribution. Then I have also the samples for the v and then the true value which is one. We can see that we’re getting those estimates and we can then look also at graphical representations for functions of those parameters. For example, we could also summarize the posterior distribution for the modulus and the period of the reciprocal roots of the characteristic polynomial. So in this case, the good thing with simulation-based posterior sampling is that if we have the samples of the AR coefficients and the variance, we can simply just do any transformations for any functions of those and obtain the posterior summaries of those transform values. In this case, the modulus, we can obtain it using a transformation of the Phi_2 coefficients so if we have samples of those we can just look at those, and then for the period we can also do the same so this is a function now of the modulus and this first coefficient so we just look at that and then can obtain just the summaries of those histograms for the posterior distribution of the modulus. We can see here the true value was 0.95 and this is a histogram with the posterior samples for that modulus based on those 1,000 samples from the posterior, and this is what I get in terms of the histogram for the period. The true value for the period was 12 in this particular example. This gives you also uncertainty quantification for those parameters of the distribution. Another thing we discussed is we can fix the model order and then obtain posterior inference but in practice we usually also have uncertainty on the model order and we said, well, we can use different procedures to try to look at that problem. One of the methods that we can use is just pick the model order using some model selection criteria. This code just looks at how you can use AIC and BIC for finding the optimal model order and then conditional on that optimal model order then you can proceed with your posterior inference. In this case, again, this is for the simulated dataset, I’m going to assume that I don’t know the model order and we’re going to set the maximum model order to be 10. I’m going to be searching for the optimal model order in AR models that have a maximum model order of 10 and then we will see what happens here. For each of those model orders I can just keep all the matrices, the X matrix and the Y matrix just to proceed with the maximum likelihood inference, and the maximum likelihood inference is just using I just created this function here to compute the MLE and it returns essentially all the quantities that we need to have here, the degrees of freedom, the residuals, the MLE, and so on for a given model order with a given structure in terms of the X which depends on the data that you have and the Y that also depends on the data that you have. There is this AIC BIC function that simply uses calls this MLE function and then computes the AIC and the BIC with different penalties. As we said, if you look a little bit inside this function you will see that the AIC has a penalty in terms of the number of parameters in the model which is 2 times P and in this case we have a log N times P, so this is just how to compute this AIC. So it’s just again computing those. Based on this information that we have we can plot the AIC and the BIC for different values of the model orders and highlight in this case we’re looking at the minimizes, so is the difference between the value and the minimum value of the AIC. In this case, we want to look at the value that minimizes this quantity and we see that the model order for both the AIC and the BIC, the optimal model order happens to be two. Once we have that optimal model order we can then say, well, I can choose the model order that minimizes one of the two criteria and then in this case I’m choosing BIC. For this particular example, both model selection criteria are giving me the same answer in terms of the model order, this doesn’t have to be the case for all the datasets so you will have different answers in terms of what AIC is telling you is the optimal model order and what BIC is telling you that is the optimal model order. In this example, they gives you the same answer. Then you can proceed with the posterior mean, the maximum likelihood estimation, the reference and the posterior mean, and the posterior mean of the standard deviation, and then you can look at estimates also based on those posterior means of the AR coefficients, you can obtain the estimate for the modulus and the period of the reciprocal roots, so I’m just here using that optimal model order looking at estimates of those based on the posterior distribution. Here again, you can just pick one of the two model selection criteria to obtain your optimal model order and then do your inference conditional on that.

95.5 code: Bayesian inference, AR(p), conditional likelihood 🗒️ \mathcal{R}

95.5.1 Simulate 300 observations from an AR(2) with one pair of complex-valued roots

set.seed(2021)
r=0.95
lambda=12 
phi=numeric(2) 
phi[1]=2*r*cos(2*pi/lambda) 
phi[2]=-r^2
sd=1 # innovation standard deviation
T=300 # number of time points
# generate stationary AR(2) process
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) 

par(mfrow=c(1,1), mar = c(3, 4, 2, 1) )
plot(yt)

Figure 95.3: Simulated AR(2) process with complex-valued roots

95.5.2 Compute the MLE of phi and the unbiased estimator of v using the conditional likelihood

p=2
y=rev(yt[(p+1):T]) # response
X=t(matrix(yt[rev(rep((1:p),T-p)+rep((0:(T-p-1)),rep(p,T-p)))],p,T-p));
XtX=t(X)%*%X
XtX_inv=solve(XtX)
phi_MLE=XtX_inv%*%t(X)%*%y # MLE for phi
phi_MLE

           [,1]
[1,]  1.6527203
[2,] -0.9189823

s2=sum((y - X%*%phi_MLE)^2)/(length(y) - p) #unbiased estimate for v
s2

[1] 0.9901292

95.5.3 Posterior inference, conditional likelihood + reference prior via direct sampling

n_sample=1000 # posterior sample size
library(MASS)

## step 1: sample v from inverse gamma distribution
v_sample=1/rgamma(n_sample, (T-2*p)/2, sum((y-X%*%phi_MLE)^2)/2)

## step 2: sample phi conditional on v from normal distribution
phi_sample=matrix(0, nrow = n_sample, ncol = p)
for(i in 1:n_sample){
  phi_sample[i, ]=mvrnorm(1,phi_MLE,Sigma=v_sample[i]*XtX_inv)
}

par(mfrow = c(2, 3), mar = c(3, 4, 2, 1),  cex.lab = 1.3)
## plot histogram of posterior samples of phi and v

for(i in 1:2){
  hist(phi_sample[, i], xlab = bquote(phi), 
       main = bquote("Histogram of "~phi[.(i)]),col='lightblue')
  abline(v = phi[i], col = 'red')
}

hist(v_sample, xlab = bquote(nu), main = bquote("Histogram of "~v),col='lightblue')
abline(v = sd, col = 'red')

Figure 95.4: Posterior distributions of AR(2) parameters

95.5.4 Graph posterior for modulus and period

r_sample=sqrt(-phi_sample[,2])
lambda_sample=2*pi/acos(phi_sample[,1]/(2*r_sample))

hist(r_sample,xlab="modulus",main="",col='lightblue')
abline(v=0.95,col='red')

hist(lambda_sample,xlab="period",main="",col='lightblue')
abline(v=12,col='red')

95.6 code: Model order selection 🗒️ \mathcal{R}

95.6.1 Simulate data from an AR(2)

set.seed(2021)
r=0.95
lambda=12 
phi=numeric(2) 
phi[1]=2*r*cos(2*pi/lambda) 
phi[2]=-r^2
sd=1 # innovation standard deviation
T=300 # number of time points
# generate stationary AR(2) process
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) 
par(mfrow=c(1,1), mar = c(3, 4, 2, 1) )
plot(yt)

Figure 95.6: Simulated AR(2) process with complex-valued roots

95.6.2 compute MLE for different AR(p)s

pmax=10 # the maximum of model order
Xall=t(matrix(yt[rev(rep((1:pmax),T-pmax)+rep((0:(T-pmax-1)),
              rep(pmax,T-pmax)))], pmax, T-pmax));
y=rev(yt[(pmax+1):T])
n_cond=length(y) # (number of total time points - the maximum of model order)

## compute MLE
my_MLE <- function(y, Xall, p){
  n=length(y)
  x=Xall[,1:p]
  a=solve(t(x) %*%x)
  a=(a + t(a))/2 # for numerical stability 
  b=a%*%t(x)%*%y # mle for ar coefficients
  r=y - x%*%b # residuals 
  nu=n - p # degrees freedom
  R=sum(r*r) # SSE
  s=R/nu #MSE
  return(list(b = b, s = s, R = R, nu = nu))
}

95.6.3 Compute AIC and BIC for different AR(p)s based on simulated data

## function for AIC and BIC computation 
AIC_BIC <- function(y, Xall, p){
  ## number of time points
  n <- length(y)
  
  ## compute MLE
  tmp=my_MLE(y, Xall, p)
  
  ## retrieve results
  R=tmp$R
  
  ## compute likelihood
  likl= n*log(R)
  
  ## compute AIC and BIC
  aic =likl + 2*(p)
  bic =likl + log(n)*(p)
  return(list(aic = aic, bic = bic))
}
# Compute AIC, BIC 
aic =numeric(pmax)
bic =numeric(pmax)

for(p in 1:pmax){
  tmp =AIC_BIC(y,Xall, p)
  aic[p] =tmp$aic
  bic[p] =tmp$bic
  print(c(p, aic[p], bic[p])) # print AIC and BIC by model order
}

[1]    1.000 2166.793 2170.463
[1]    2.000 1635.816 1643.156
[1]    3.000 1637.527 1648.536
[1]    4.000 1639.059 1653.738
[1]    5.000 1640.743 1659.093
[1]    6.000 1641.472 1663.491
[1]    7.000 1643.457 1669.147
[1]    8.000 1645.370 1674.729
[1]    9.000 1646.261 1679.290
[1]   10.000 1647.915 1684.614

## compute difference between the value and its minimum
aic =aic-min(aic) 
aic

 [1] 530.977092   0.000000   1.710562   3.242352   4.927067   5.655454
 [7]   7.641270   9.553690  10.444845  12.099207

bic =bic-min(bic) 
bic

 [1] 527.307211   0.000000   5.380443  10.582114  15.936709  20.334978
 [7]  25.990674  31.572975  36.134011  41.458254

95.6.4 Plot AIC, BIC, and the marginal likelihood

par(mfrow = c(1, 1), mar = c(3, 4, 2, 1) )
matplot(1:pmax,matrix(c(aic,bic),pmax,2),ylab='value',
        xlab='AR order p',pch="ab", col = 'black', main = "AIC and BIC for Model Selection")
# highlight the model order selected by AIC
text(which.min(aic), aic[which.min(aic)], "a", col = 'red') 
# highlight the model order selected by BIC
text(which.min(bic), bic[which.min(bic)], "b", col = 'red') 

legend("topright", legend = c("AIC", "BIC"), col = c("black", "black"), pch = c("a", "b"))

p <- which.min(bic) # We set up the moder order
print(paste0("The chosen model order by BIC: ", p))

[1] "The chosen model order by BIC: 2"

Figure 95.7: AIC, BIC for different AR(p) model orders

95.7 Spectral representation of the AR(p) 🎥

Figure 95.8: Spectral representation of the AR(p)

The spectral representation of the AR(p) process is a powerful tool for understanding the frequency domain properties of the process. It allows us to analyze how the process behaves at different frequencies and provides insights into its periodicity and persistence.

Supose we have an AR(p) process with parameters:

\phi_1, \ldots, \phi_2, v

Properties related to the autocorrelation function of the process, and the forecast function as well as we have seen, are related to the reciprocal roots of the characteristic polynomial.

Using this information, we can also obtain a spectral representation of the process, meaning a representation in the frequency domain. We can think of this as a density in the frequency domain where you’re going to have a power associated to different frequencies. In the case of an AR(p) process, I can write down the spectral density of the process, as:

f(\omega) = \frac{v}{ \mid 1 - \phi_1 e^{-i\omega} - \ldots - \phi_p e^{-i \omega p}) \mid^2 2\pi}

with:

\omega \in [0,\pi] the frequency

For instance, if you have an AR(2), an autoregressive process of order 2, and you happen to have a pair of complex roots. Let’s assume that the reciprocal roots are \alpha_1, \alpha_2 with modulus r and period \lambda, then the spectral density of the AR(2) process is given by:

\alpha_1 \qquad r=0.7

\alpha_2 \qquad \lambda=12

Video Transcript

It is good to talk a little bit about what is the relationship between, the AR(p) in the time domain and what spectral representation in the frequency domain we can get. This is important for interpretability and it provides a way to just see what’s happening or what kind of properties we have in the process as well. In the case of an AR(p), you’re going to have your coefficients and your variance v. If you have an AR(p), those are the AR coefficients that we have talked about. Properties related to the autocorrelation function of the process, and the forecast function as well. We’ve seen that they are related to the reciprocal roots of the characteristic polynomial.

Using this information, we can also obtain a spectral representation of the process, meaning a representation in the frequency domain.

Think of this as a density in the frequency domain where you’re going to have a power associated to different frequencies. In the case of an AR(p), I can write down the spectral density of the process, calling that f(\omega) here is a frequency between 0 and \pi. You can think of this frequency. Then I can write this down as v, which is the variance of the process. Then I’m going to have here, let me write it in terms of the 1 - \phi_1 e^{-i\omega} - \ldots - \phi_p e^{-i \omega p}. I have my component here, and then I have a times 2\pi over here. If I think about a frequency between zero and \pi, for example, then I can plug in a frequency here. You can see that this is essentially the AR characteristic polynomial evaluated on e^{-i\omega}, which is that representation. Once again, this frequency domain representation is going to be related to the characteristic roots of the AR polynomial.

For instance, if you have an AR(2), an autoregressive process of order 2, and you happen to have a pair of complex roots. Let’s assume that the reciprocal roots are \alpha_1 and \alpha_2, and then that I have a modulus which is 0.7.

Then a frequency that I’m going to write it in terms of the period. The periodicity here, let’s say that is the quasi periodicity in the AR is given by 12.

There is again, I can go back and forth between the complex valued representation and the polar coordinates and so on.

Let’s say that I have that pair of roots with modulus 0.7 and period 12. What I’m going to do is I plug in the representation in here.

For each of these \alpha_1 and \alpha_2, that correspond to values of \phi_1 and \phi_2. I can plug those in here, use the variance, and then plot that as a function of the \omega.

What happens there is you are going to have something that looks like this.

The peak of that density is going to be at the frequency that corresponds to this period of 12. The height that you have, the power in that spectral density is going to be related to the modulus of that pair of complex roots. If you increase the modulus, if you now have a different process with the same quasi-periodicity, but a modulus that is higher. What’s going to happen is that’s going to result in a higher peak, in a higher power for your spectral density. If you have another process in which you get a larger period or a smaller period that’s going to result in shifts in the mode of that density.

This is a way to represent the process in the spectral domain.

95.8 Spectral representation of the AR(p): Example 🎥

This video walks through the code in the next section.

95.8.1 Summary: Posterior Spectral Density Estimation for AR(2)

Goal: Estimate and visualize the spectral density of an AR(2) process and its posterior distribution based on observed data.
Simulation: 300 observations are drawn from an AR(2) process with complex conjugate roots of modulus 0.95 and period 12, leading to a quasi-periodic process.
Estimation:
- AR coefficients estimated via MLE.
- Innovation variance v estimated as s^2 (unbiased).
- Posterior samples:
  - v \sim \text{Inverse-Gamma}
  - \boldsymbol\Phi \mid v \sim \text{Bivariate Normal}
Spectral density estimation:
- spec.ar(): Uses MLEs to estimate the spectral density.
- arma.spec() from astsa: Allows using arbitrary (MLE or posterior) estimates of AR coefficients and variance.
  - Accepts posterior samples, producing a distribution of spectral density curves.
Results:
- Plots show a spectral peak at period 12, as expected from the simulation.
- Posterior spectral densities are overlaid.
- Solid line = MLE-based spectrum; dotted line = true spectrum peak.

This process illustrates Bayesian uncertainty in frequency-domain features of AR processes.

Video Transcript

We will now see how to obtain the spectral density of an autoregressive process. Also how to look at posterior distributions of the spectral densities of this AR process just based on data that we have.

The first thing I’m going to do here is again, I’m going to simulate some data from an autoregressive process of order 2. This process has one pair of complex valued reciprocal roots with modulus 0.95 and period 12. In this case, I’m just going to do what we’ve done before, which is just simulate 300 observations from this AR(2) process.

Based on this model order of two, we can obtain the maximum likelihood estimators for the AR coefficients and also the unbiased estimator for v, which we call s^2. Based on these, we can obtain 200 samples from the posterior distribution using direct simulation. We’ve done this before.

Again, I’m going to be sampling v from the inverse Gamma distribution, and then conditional on v, I can obtain samples for the \Phi using the bivariate normal.

Based on these estimates or these samples, I can just look at what is the spectral density of this process.

I’m going to use the spec.ar function. You just pass the data, you specify the model order, and then it just uses estimates.

The maximum likelihood estimator of this AR process for the coefficients and the variance and then it does the transformation to obtain this estimate of the spectrum density. You can use the function spec.ar, and then R automatically produces this estimate of the spectral density. As you can see for this process, the process is going to be quasiperiodic. There is going to be a peak at the periodicity. In this case the periodicity is 12 because that’s how we simulated the process.

You can also use another function that is called arma.spec from the astsa package, to draw the spectral density. I’m going to also illustrate this here. If you want to use this, you call this library. The good thing with this function is that you can pass your estimates, whatever they are, for the AR coefficients and for the variance. In particular here, passing the maximum likelihood estimator based on the conditional likelihood for the AR coefficients. Then for the variance of the error terms, I’m just passing the s^2, the s squared. But you can pass any other estimator. It could be a posterior sample. It could be anything else.

Then we obtain this again. This is a figure that is very similar to the one we had before, just plotting the spectral density. Again, we see that the power spectrum has this peak at the quasi periodic component with periodicity 12.

We can also pass, as I said, this function, the nice thing is that it allows you to pass any estimates or any values for the Phis and the variance. In this case, I’m just passing samples from the posterior distribution. For each of the samples of the posterior distribution that I have, I can just compute this arma.spec, this estimate of the spectral density. This takes a little bit of time to compute because we have a few samples here and it has to do the calculations for each of those samples. Then based on the results, you can then plot. For each of those samples, you obtain a spectral density and for each of those spectral densities, you can plot essentially the corresponding spectral density.

For each of those samples that you have for the AR coefficients and the variance, you’re going to get one of this samples for the spectral density.

I’m just plotting here these as a distribution. Then the solid line here corresponds to the maximum likelihood estimator. The dotted line corresponds to the true value in the sense of where the peak of the log spectral density should be and it corresponds to a period of 12, which is consistent with what we obtain in terms of the posterior distribution.

95.9 code: Spectral density of AR(p) 🗒️ \mathcal{R}

We will now obtain the spectral density of an autoregressive process, as well as visualize posterior distributions of the spectral densities of this AR process just based on data that we have

Simulate 300 observations from an AR(2) process with a pair of complex-valued roots

set.seed(2021)
r=0.95
lambda=12 
phi=numeric(2) 
phi[1]<- 2*r*cos(2*pi/lambda) 
phi[2] <- -r^2
sd=1 # innovation standard deviation
T=300 # number of time points
# sample from the AR(2) process
yt=arima.sim(n = T, model = list(ar = phi), sd = sd)

Compute the MLE of \phi and the unbiased estimator of v using the conditional likelihood

p=2
y=rev(yt[(p+1):T])
X=t(matrix(yt[rev(rep((1:p),T-p)+rep((0:(T-p-1)),rep(p,T-p)))],p,T-p));
XtX=t(X)%*%X
XtX_inv=solve(XtX)
phi_MLE=XtX_inv%*%t(X)%*%y # MLE for phi
s2=sum((y - X%*%phi_MLE)^2)/(length(y) - p) #unbiased estimate for v

Obtain 200 samples from the posterior distribution under the conditional likelihood and the reference prior and the direct sampling method

n_sample=200 # posterior sample size
library(MASS)

## step 1: sample v from inverse gamma distribution
v_sample=1/rgamma(n_sample, (T-2*p)/2, sum((y-X%*%phi_MLE)^2)/2)

## step 2: sample phi conditional on v from normal distribution
phi_sample=matrix(0, nrow = n_sample, ncol = p)
for(i in 1:n_sample){
  phi_sample[i,]=mvrnorm(1,phi_MLE,Sigma=v_sample[i]*XtX_inv)
}

using spec.ar to draw spectral density based on the data assuming an AR(2)

spec.ar(yt, order = 2, main = "yt")

using arma.spec from astsa package to draw spectral density

plot spectral density of simulated data with posterior sampled ar coefficients and innvovation variance

library("astsa")

par(mfrow = c(1, 1), mar = c(3, 4, 2, 1) )
#result_MLE=arma.spec(ar=phi_MLE, var.noise = s2, log='yes',main = '')
result_MLE=arma.spec(ar=phi_MLE, var.noise = s2, main = '')
freq=result_MLE$freq
  
spec=matrix(0,nrow=n_sample,ncol=length(freq))

for (i in 1:n_sample){
result=arma.spec(ar=phi_sample[i,], var.noise = v_sample[i],# log='yes',
                 main = '',plot=FALSE)
spec[i,]=result$spec
}

Figure 95.9: Spectral density of AR(2) process using arma.spec

plot(2*pi*freq,log(spec[1,]),type='l',ylim=c(-3,12),ylab="log spectra",
     xlab="frequency",col=0)
#for (i in 1:n_sample){
for (i in 1:2){
lines(2*pi*freq,log(spec[i,]),col='darkgray')
}
lines(2*pi*freq,log(result_MLE$spec))
abline(v=2*pi/12,lty=2,col='red')

Figure 95.10: Spectral density of AR(2) process using arma.spec

95.10 ARIMA processes 🗒️

This section introduces the ARMA and ARIMA processes, their definitions, stability, invertibility, and spectral density. It is based on a handout from the course. Ideally it should be expanded with more details and examples and code for simulating ARMA and ARIMA processes. I’m not sure why Prado didn’t go any deeper in the course, but the NDLMs generalize the ARMA and ARIMA processes, so perhaps we will cover them as special cases of the NDLMs in the next lessons.

ARMA Model Definition

A time series process is a zero-mean autoregressive moving average process if it is given by

\operatorname{ARMA}(p,q)= \textcolor{red} {\underbrace{\sum_{i=1}^{p} \phi_i y_{t-i}}_{AR(p)}} + \textcolor{blue} {\underbrace{\sum_{j=1}^{q} \theta_j \varepsilon_{t-j}}_{MA(Q)}} + \varepsilon_t \tag{95.9}

with \varepsilon_t \sim N(0, v).

For q = 0, we get an AR(p) process.
For p = 0, we get a MA(q) i.e. moving average process of order q.

Next we will define the notions of stability and invertibility of an ARMA process.

Stability Definition

An ARMA process is stable if the roots of the AR characteristic polynomial

\Phi(u) = 1 - \phi_1 u - \phi_2 u^2 - \ldots - \phi_p u^p

lie outside the unit circle, i.e., for all u such that \Phi(u) = 0, \|u\| > 1.

Equivalently, this happens when the reciprocal roots of the AR(p) polynomial have moduli smaller than 1.

This condition implies stationarity.

stable

Invertible ARMA Definition

An ARMA process is invertible if the roots of the MA characteristic polynomial given by

\Theta(u) = 1 + \theta_1 u + \ldots + \theta_q u^q, \tag{95.10}

lie outside the unit circle.

invertible

Note that \Phi(B) y_t = \Theta(B) \varepsilon_t.

When an ARMA process is stable, it can be written as an infinite order moving average process.
When an ARMA process is invertible, it can be written as an infinite order autoregressive process.

ARIMA Processes

An autoregressive integrated moving average process with orders p, d, and q is a process that can be written as

(1 - B)^d y_t = \sum_{i=1}^{p} \phi_i y_{t-i} + \sum_{j=1}^{q} \theta_j \varepsilon_{t-j} + \varepsilon_t, \tag{95.11}

where B is the backshift operator, d is the order of integration, and \varepsilon_t \sim N(0, v).

in other words, y_t follows an ARIMA(p, d, q) if the d difference of y_t follows an ARMA(p, q).

Estimation in ARIMA processes can be done via least squares, maximum likelihood, and also in a Bayesian way. We don’t discuss Bayesian estimation of ARIMA processes in this course.

95.10.1 Spectral Density of ARMA Processes

For a given AR(p) process with AR coefficients \phi_1, \dots, \phi_p and variance v, we can obtain its spectral density as

f(\omega) = \frac{v}{2\pi \|\Phi(e^{-i\omega})\|^2} = \frac{v}{2\pi \|1 - \phi_1 e^{-i\omega} - \ldots - \phi_p e^{-ip\omega}\|^2}, \tag{95.12}

with \omega a frequency in (0, \pi).

The spectral density provides a frequency-domain representation of the process that is appealing because of its interpretability.

For instance, an AR(2) process that has one pair of complex-valued reciprocal roots with modulus 0.7 and a period of \lambda = 12, will show a mode in the spectral density located at a frequency of 2\pi/12. If we keep the period of the process at the same value of 12 but increase its modulus to 0.95, the spectral density will continue to show a mode at 2\pi/12, but the value of f(2\pi/12) will be higher, indicating a more persistent quasi-periodic behavior.

Similarly, we can obtain the spectral density of an ARMA process with AR characteristic polynomial \Phi(u) = 1 - \phi_1 u - \ldots - \phi_p u^p and MA characteristic polynomial \Theta(u) = 1 + \theta_1 u + \ldots + \theta_q u^q, and variance v as

f(\omega) = \frac{v}{2\pi} \frac{\|\Theta(e^{-i\omega})\|^2}{\|\Phi(e^{-i\omega}) \|^2}. \tag{95.13}

Note that if we have posterior estimates or posterior samples of the AR/ARMA coefficients and the variance v, we can obtain samples from the spectral density of AR/ARMA processes using the equations above.