90 The AR(1): MLE and Bayesian inference

Learning Objectives

Perform maximum likelihood estimation for the full and conditional likelihood in an AR(1) #
Perform Bayesian inference for the AR(1) under the conditional likelihood and the reference prior #

90.1 Maximum likelihood and Bayesian inference in regression 🗒️

This section is based on the handout provided in the course materials. The gist of this handout is that if we can write the regression as the equation of a straight line, we should be able to make use of basic algebra to invert the equation and obtain an expression for the regression coefficients. We do this by writing the equation in matrix form and then using the Moore-Penrose pseudoinverse to obtain the maximum likelihood estimator for the regression coefficients.

Moore-Penrose pseudoinverse

Prado snuck the Moore-Penrose pseudoinverse in this handout, then she used it in the class and notes. I thought this isn’t the first time this has come up in the specialization as we used MLE to some degree in all the courses, and in depth in the previous course on mixtures. However a cursory review did not uncover previous use of this mathematical tool. So I thought it might be useful to provide a brief overview of the Moore-Penrose pseudoinverse, its properties and how it is used in regression models.

Moore-Penrose pseudoinverse is a generalization of the inverse matrix that can be applied to non-square matrices.

It is not covered in typical linear algebra courses. It might come up in numerical methods courses or a second linear algebra course. But unless they cover application in regression models, it is quite likely that they skip an important point, that the Moore-Penrose pseudoinverse is used to obtain the maximum likelihood estimator for the regression coefficients in linear regression models.

I found a couple of reference in (Schott 2016 ch. 5) but they it didn’t cover they MLE aspect of the pseudoinverse.

I cover some of this material in an Section L.3 to the course notes, however I am still missing a good source for Equation 90.1.

It is denoted as X^+ and has the following properties:

This pseudo-inverse is a neat bit of mathematics which allows us to not only invert the equation from the left but it somehow minimizes the sum of squared errors in the regression - I.e. we get the parameters while minimizing the sum of squared errors due to the variance term. One strong assumption we must make is that the design matrix X is full rank, which means that the columns of the matrix are linearly independent. The columns of the design matrix correspond to the explanatory variables in the regression model.

In reality as we are dealing with hierarchical models, so we will not always have a full rank design matrix, but we can still use the Moore-Penrose pseudoinverse to obtain the maximum likelihood estimator for the regression coefficients.

90.1.1 Regression Models: Maximum Likelihood Estimation

Assume a regression model with the following structure: y_i = \beta_1x_{i,1} + \ldots + \beta_kx_{i,k} + \varepsilon_i,

for i = 1, \ldots, n and \varepsilon_i independent random variables with \varepsilon_i \sim \mathcal{N}(0, v) \quad \forall i. This model can be written in matrix form as:

y = \mathbf{X} \boldsymbol{\beta} + \boldsymbol\varepsilon \qquad \boldsymbol\varepsilon \sim \mathcal{N} (0, v\mathbf{I})

where:

y = (y_1, \ldots, y_n)′ is an n-dimensional vector of responses,
\mathbf{X} is an n × k matrix containing the explanatory variables,
\boldsymbol \beta = (\beta_1, \ldots, \beta_k)' is the k-dimensional vector of regression coefficients,
\boldsymbol \varepsilon = (\varepsilon_1, \ldots, \varepsilon_n)' is the n-dimensional vector of errors,
\mathbf{I} is an n \times n identity matrix.

If \mathbf{X} is a full rank matrix with rank k , the maximum likelihood estimator for \boldsymbol\beta, denoted as \hat{\boldsymbol\beta}_{MLE} is given by:

\hat{\boldsymbol{\beta}}_{MLE} = (\mathbf{X}'\mathbf{X})^{−1}\mathbf{X}'\mathbf{y}, \tag{90.1}

where (X^+ = \mathbf{X}'\mathbf{X})^{−1}\mathbf{X}' is the Moore-Penrose pseudoinverse of the matrix \mathbf{X}. This Moore-Penrose pseudoinverse of the matrix \mathbf{X} is the most widely known generalization of the inverse matrix and is used to obtain the least squares solution to the linear regression problem.

and the MLE for v is given by:

\hat{v}_{MLE} = \frac{1}{n} (y − \mathbf{X} \hat{\boldsymbol{\beta}}_{MLE})′(y − \mathbf{X} \hat{\boldsymbol{\beta}}_{MLE}) \tag{90.2}

\hat{v}_{MLE} is not an unbiased estimator of v, therefore, the following unbiased estimator of v is typically used:

s^2 = \frac{1}{n-k}(y − \mathbf{X} \hat{\boldsymbol\beta}_{MLE} )′(y − \mathbf{X} \hat{\boldsymbol\beta}_{MLE} ) \tag{90.3}

90.1.2 Regression Models: Bayesian Inference

Assume once again we have a model with the structure in (1), which results in a likelihood of the form

\mathbb{P}r(y \mid \boldsymbol{\beta} , v) = \frac{1}{(2\pi v)^{n/2}}\exp \left\{ -\frac{1}{2} (y − \mathbf{X} \boldsymbol{\beta})′(y − \mathbf{X} \boldsymbol{\beta}) \right\}

If a prior of the form :

\mathbb{P}r(\boldsymbol{\beta}, v) \propto \frac{1}{v} \tag{90.4}

is used, we obtain that the posterior distribution is given by:

\mathbb{P}r(\boldsymbol{\beta},v \mid \mathbf{y}) \propto \frac{1}{v^{n/2+1}}\exp \left\{ -\frac{1}{2v} (\mathbf{y} − \mathbf{X} \boldsymbol{\beta})′(\mathbf{y} − \mathbf{X} \boldsymbol{\beta}) \right\} \tag{90.5}

In addition it can be shown that

(\boldsymbol{\beta}\mid v, \mathbf{y}) \sim \mathcal{N} (\hat{\boldsymbol{\beta}}_{MLE} , v(\mathbf{X}'\mathbf{X})^{-1})
(v \mid \mathbf{y}) \sim \mathcal{IG}\left(\frac{(n − k)}{2}, \frac{1}{2}d\right) with

d = (\mathbf{y} − \mathbf{X} \hat{\boldsymbol{\beta}}_{MLE} )′(\mathbf{y} − \mathbf{X} \hat{\boldsymbol{\beta}}_{MLE} ) \tag{90.6}

where \mathcal{IG}(a, b) denotes the inverse-gamma distribution with shape parameter a and scale parameter b.

with k = dim(\boldsymbol\beta).

Given that \mathbb{P}r(\boldsymbol\beta, v \mid \mathbf{y}) = \mathbb{P}r(\boldsymbol\beta \mid v, \mathbf{y})p(v \mid \mathbf{y}) the equations above provide a way to directly sample from the posterior distribution of \boldsymbol \beta and v by first sampling v from the inverse-gamma distribution above and then conditioning on this sampled value of v, sampling \boldsymbol \beta from the normal distribution above.

90.2 Maximum likelihood estimation in the AR(1) 🎥

There are two main strategies for performing MLE for an AR(1) model:

Full Likelihood: Considers the joint distribution of all observations y_1, \dots, y_T.
Conditional Likelihood: Conditions on the first observation (y_1) and works with the likelihood of the remaining observations (y_2, \dots, y_T).

90.2.1 Model Setup

The focus is on the zero-mean AR(1) model:

Y_t = \phi Y_{t-1} + \varepsilon_t, \quad \varepsilon_t \overset{iid}{\sim} \mathcal{N}(0, v), \quad \phi \in (-1,1)

This condition ensures stationarity of the process.

90.2.2 Distributional Assumptions

Y_1 \sim \mathcal{N}\left(0, \frac{v}{1 - \phi^2}\right)
Y_t \mid Y_{t-1} \sim \mathcal{N}(\phi Y_{t-1}, v) for t \geq 2

90.2.3 Likelihood Approaches

Two approaches are considered:

90.2.3.1 1. Full Likelihood

\begin{aligned} p(y_{1:T} \mid \phi, v) &= p(y_1 \mid \phi, v) \cdot \prod_{t=2}^T p(y_t \mid y_{t-1}, \phi, v) \\ &= \frac{1}{\sqrt{2\pi \frac{v}{1 - \phi^2}}} \exp\left( -\frac{y_1^2 (1 - \phi^2)}{2v} \right) \cdot \prod_{t=2}^T \frac{1}{\sqrt{2\pi v}} \exp\left( -\frac{(y_t - \phi y_{t-1})^2}{2v} \right) \\ &= \frac{(1 - \phi^2)^{1/2}}{(2\pi v)^{T/2}} \cdot \exp\left( -\frac{1}{2v} \left[ \underbrace{ y_1^2(1 - \phi^2) + \sum_{t=2}^T (y_t - \phi y_{t-1})^2 }_{\text{Quadratic Loss } Q^*(\phi)} \right] \right) \\ &= \frac{(1 - \phi^2)^{1/2}}{(2\pi v)^{T/2}} \exp\left( -\frac{Q^*(\phi)}{2v} \right) \end{aligned} \tag{90.7}

where Q^*(\phi) is defined as:

Q^*(\phi) = \underbrace{y_1^2(1 - \phi^2)\vphantom{\sum_{t=2}^T (y_t - \phi y_{t-1})^2}}_{\text{Initial Loss}} + \underbrace{\sum_{t=2}^T (y_t - \phi y_{t-1})^2}_{\text{Remaining Loss } Q(\phi)} \tag{90.8}

\begin{aligned} p(y_{1:T} \mid \phi) &= \prod_{t=2}^T \frac{1}{\sqrt{2\pi v}} \exp\left( -\frac{(y_t - \phi y_{t-1})^2}{2v} \right) \\ &= \frac{1}{(2\pi v)^{T/2}} \exp\left( -\sum_{t=2}^T \frac{1}{2v} \left( y_t - \phi y_{t-1} \right)^2 \right) \\ &= \frac{1}{(2\pi v)^{T/2}} \exp\left( -\frac{Q(\phi)}{2v} \right) \end{aligned} \tag{90.9}

where Q(\phi) is the quadratic loss function defined as:

\underbrace{ \begin{pmatrix}y_1 \\ y_2 \\ \vdots \\ y_T \end{pmatrix} }_{\utilde{y}} = \underbrace{ \begin{pmatrix}y_1 \\ y_2 \\ \vdots \\ y_T \end{pmatrix} }_{\mathbb{X}} \underbrace{ \phi \vphantom{\begin{pmatrix}y_1 \\ y_2 \\ \vdots \\ y_T \end{pmatrix} }}_{\beta} + \underbrace{ \begin{pmatrix}\varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_T\end{pmatrix}}_{\utilde {\varepsilon}} \tag{90.10}

where \utilde{y} is the vector of observations, \mathbb{X} is the design matrix with y_1 as the first column and y_2, \ldots, y_{T-1} as the second column, \beta = \phi is the AR coefficient, and \utilde{\varepsilon} \sim \mathcal{N}(0, vI) is the error term.

\utilde{y} = \mathbb{X} \beta + \utilde{\varepsilon} \qquad \utilde{\varepsilon} \sim \mathcal{N}(0, vI)

where \mathbb{X} is the design matrix

with y_1 as the first column and y_2, \ldots, y_{T-1} as the second column.

If the matrix \mathbb{X} is full rank, the MLE for \phi can be obtained as:

\hat{\beta} = (\mathbb{X}'\mathbb{X})^{-1}\mathbb{X}'y and the MLE for v is given by:

\hat{v} = S^2 = \frac{(y - \mathbb{X}\hat{\beta})'(y - \mathbb{X}\hat{\beta})}{\dim(y)-\dim(\beta)}

and the MLE for \phi. \hat{\phi}_{MLE} = \frac{\sum_{t=2}^T y_t y_{t-1}}{\sum_{t=2}^T y_{t-1}^2} \tag{90.11}

and the unbiased estimator for v is given by:

S^2 = \sum_{t=2}^T (y_t - \hat{\phi}_{MLE} y_{t-1})^2 / (T - 2) \tag{90.12}

where T is the number of time points and S^2 is the unbiased estimator for the variance v. And we usually use this unbiased estimator for v in practice, as the MLE for v is biased.

90.2.3.2 2. Conditional Likelihood

Maximizing the full likelihood requires numerical optimization methods (e.g., Newton-Raphson), as there’s no closed-form solution.

This setup is equivalent to a linear regression:

\mathbf{y} = X\beta + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, vI)

Where:

\mathbf{y} = [y_2, \dots, y_T]^T
X = [y_1, \dots, y_{T-1}]^T
\beta = \phi

Condition on y_1:

\begin{aligned} p(y_1 \mid \phi) &\sim \mathcal{N}(0, 1/(1 - \phi^2)) \\ p(y_t \mid y_{t-1}, \phi) &\sim \mathcal{N}(\phi y_{t-1}, 1) \\ \end{aligned}

\begin{aligned} p(y_{1:T} \mid y_1, \phi, v) &= p(y_1 \mid \phi) \cdot \prod_{t=2}^T p(y_t \mid y_{t-1}, \phi)\\ &= \frac{(1 - \phi^2)^{1/2}}{(2\pi)^{T/2}} \exp\left(-\frac{y_1^2(1 - \phi^2)}{2}\right) \cdot \prod_{t=2}^T \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(y_t - \phi y_{t-1})^2}{2}\right) \\ &= \frac{(1 - \phi^2)^{1/2}}{(2\pi)^{T/2}} \exp\left(-\frac{1}{2} \left[y_1^2(1 - \phi^2) + \sum_{t=2}^T (y_t - \phi y_{t-1})^2\right]\right) \\ \end{aligned}

So the log-likelihood becomes:

\log p(y_{1:T} \mid y_1, \phi) = \frac{1}{2} \log(1 - \phi^2) - \frac{1}{2} Q(\phi) + K \tag{90.13}

where K is a constant that does not depend on \phi.

So if I were to look at maximizing this function, we can think about taking first derivatives with respect to phi. And then we will see that again the expression that you obtain doesn’t allow you to obtain a close form expression for \hat{\phi}_{MLE}. Instead, we will need to use a numerical optimization method such as Newton Raphson to obtain the maximum likelihood estimator for phi.

90.2.4 Conclusion

Full likelihood is more general but computationally intensive.
Conditional likelihood simplifies estimation by leveraging regression theory.
When variance v is known (e.g., v = 1), the optimization reduces to maximizing a univariate function of \phi.
For full likelihood, optimization of Q^*(\phi) is required.

Feature	Full Likelihood	Conditional Likelihood (Regression)
Accounts for Y\_1	✅ Yes	❌ No
MLE for \phi	❌ No closed form	✅ Closed form
MLE for v	❌ Biased unless adjusted	✅ Unbiased estimator available
Optimization Needed	✅ Yes (numerical methods)	❌ No (closed-form MLE for \phi)
Useful when	Modeling full joint process	Estimating \phi efficiently in practice

Video Transcript

I will now talk about how to perform maximum likelihood estimation in the case of the autoregressive model of order one. Recall that we have the equation for the process. This is a zero mean AR1, looks like this. And then the epsilon t’s are iid normal (0,v). So the parameters of the model here, our phi, the AR coefficient, and the variance associated to the epsilon t’s. We have two options to perform maximum likelihood estimation in this case. One case is to consider the full likelihood, and the other case I will be describing is the case in which we work only with a conditional likelihood. And in the case of the AR1, we can condition on the first observation, and then work only with that conditional likelihood. I will begin working with the full likelihood here. If you recall, we are also going to assume that the process is such that the phi is in the (-1,1) interval, which implies the process is going to be stationary. And in that case, we said that the yt’s have a particular distribution. They are all going to be normally distributed. In particular, the first y, y1, is going to be normally distributed. It’s going to be zero mean. And the variance is that term that we described before, is that v / (1- phi squared). For the rest of the yt’s, because of the temporal dependency, I have to write down the distribution of yt given y_{t-1}. So for the first y1, I have this distribution, and for the remaining ones, where yt given y_{t-1}, I have a normal. Now the mean is phi, y_{t-1}, and the variance is v. So combining these two pieces, if I now think about having observations y1, all the way to y capital T. I can write down the likelihood function as a function of v and the phi coefficient as follows. So I can write down, 1 to T given phi and v as the product of the first term, which is this first density. This one depends on phi and v. And then I have the product from t = 2 up to capital T. Phi and v, so this piece gives me a normal. So I’m going to have that first piece is going to be 1 / (2pi v) one-half. And then I have my 1- phi squared, To the power of one-half. And then I have my exponential piece. Over 2v, so that’s the first density, y1 given phi and v. And then I have this product of all these normal densities here, yt given y_{t-1}. So in this case, when I consider this product, we’re going to have, T-1 terms, so that’s to the power of (T-1)/2. And then I have this exponential. And t equals from 2 up to capital T. And this is divided by 2v. So this is the expression that I obtain. So I can put some terms together here and just write this down, simplify this expression a bit. And then this is all, Divided by 1/2v. So this gives me the expression for the full likelihood. If you look at what is in here, this term, I can call this term Q star of phi. So here, Q star, of phi is that y1 squared(1- phi squared). And so this expression, Can be written like this. So if I look at this expression now, if I want to perform maximum likelihood estimation, I would have to maximize this function viewed as a function of phi and v. So I have this entire expression here. The Q star is given by this expression. So it has two components, one component that has to do with the first y. And then the remaining components all go from this t goes from 2 to capital T. So if you think about the first term, the term for t = 2, you have the terms y2 and y1 appearing here. So this piece, we can see later, is the one that we will use when we are doing maximum likelihood estimation using the conditional likelihood. And I will call this expression later Q(phi). But it has these two terms, the Term that corresponds to this component for the first y and then the remaining terms. We will study how to perform maximum likelihood estimation in these two cases. This gives me the full likelihood. We can also work with the so called conditional likelihood, that conditions on the first observation. So, in here I can write down my likelihood, now it is going to go from, 2 up to T. I’m conditioning on the first y1. And then I view this again as a function of phi and v. And the y1 is given. So, in this case, My likelihood is going to have. All the terms from 2 up to T. And then I’m going to use the fact that they are all normally distributed conditional on the previous value. So I’m just going to have here. So, we can write down all these terms. We’re going to have T -1 of them, and then I’m going to have that exponential of the sum of the yt minus phi minus 1 squared over 2v, which is what we have called before the summation here. That goes from t equals 2 up to T is what we have called Q of phi. So, rewriting this expression. And the nice thing about working with the conditional likelihood, is that if you view this as a function of v and phi. What appears in here, is really going to be a quadratic function of the phi parameter. So it’s much easier to maximize that likelihood function because all the results of the maximization are going to appear in close form when we work with the conditional likelihood. Therefore it’s easy if we work with the full likelihood, we need to use a numerical maximization procedure. So just to show you how to obtain maximum likelihood estimation in this particular framework using the conditional likelihood, is useful to see a correspondence between this likelihood and a linear regression model. So, I can write down for the case of the conditional likelihood again, something that looks like this. So, I can stack all my ys from 2 up to T in a single vector. I can call that vector y, then I can put my phi coefficient in here. And then just thinking about what is my dependency on the past. y2 is going to be dependent on y1, y3 will be dependent on y2 and so on. And then I’m going to have here, all the way down to y_{T -1}. And then I have my epsilon2 all the way to epsilont. And I’m going to call this an x, in this case is a vector, but in the general regression it can be a matrix. And then this here would be my beta, again in the general regression framework this is usually a vector. And here I’m going to call this my epsilon vectors. So, if you go and look at the results in terms of regression model, we have a model that looks like this, And then the epsilon, Follows here a normal zero v times an identity matrix. So when we work with this expression we know that in order to obtain maximum likelihood estimation, assuming that the matrix X is a full rank matrix. We obtain maximum likelihood estimation as beta hat, X transpose X, inverse X transpose y. And we can also obtain an estimate for v. That I’m going to call s square, is my estimate for v here. We can simply obtain this as y, in general X beta hat transpose y X beta hat. And then here is going to to be divided by the dimension off the y vector. And by that, I mean how many components I have in that vector minus the dimension of the beta vector. So we can apply these equations to the case of here y1, to obtain maximum likelihood estimators for phi and v. In the case of the maximum likelihood estimator for phi, I want to call it phi hat and v. We simply take, if we look at this formula, we’re going to take this transpose. So it’s a column vector that becomes a row vector, multiplied by itself. So I’m going to have the summation of all these square terms, and then I have to take the inverse of that. And then the next piece which is my X transpose times y is going to be the product of this vector times this vector. So this results in having my maximum likelihood estimator given as, from t2 up to T of y_{t- 1} divided by this sum of the y_{t-1} square, for t2 up to T. That’s my maximum likelihood estimator for phi. And then I can obtain my s square in this case is also just simply taking the sum, when I do this product, I get the some of the yt minus phi hat MLE times y_{t -1} squared. And this goes from 2 of the T, divided by the dimension of this y vector, which is going to be T -1 and the dimension of the beta here, which in our case is a scalar, so it’s 1. So I’m going to curve here. T - 2. This gives me the two estimates for the parameters that we can use for maximum likelihood in the case of phi hat. We can also obtain the maximum likelihood estimator for v. But we usually work with this unbiased estimator. In the case of the full likelihood is not possible to establish a correspondence between the equations that we have for the likelihood of the autoregressive process and the linear regression model. So one has to do numerical optimization or numerical maximization if one wants to get the maximum likelihood estimators for the AR coefficient in the case of the AR(1) and also for the variance. Let’s just look at one example. So again recall that we’re working with the AR(1). And just to simplify things here a little bit I’m going to assume that we only want to perform maximum likelihood estimation for phi and that we know the variance of the process. And let’s assume that that variance is, For the epsilon ts, they are all normal 0,1. So there is no v here, only the phi. So if you recall when we work with a full likelihood, there are two pieces. The first piece is the distribution of y1 given phi which in this case is normal 0. And then we have that 1 over 1 minus phi square. And here again I’m assuming that phi is in this interval (-1,1). And then we have the piece that is all the remaining ts conditional on y. So the yt giving y_{t-1} for t going from two all the way to capital T. And then we have this is And then 1. So when we put together the likelihood, the full likelihood for this case, the only parameter is phi. So we’re going to have that product of p of y1 given phi times the remaining terms. And so again if we think about the expressions we have the 1 minus phi square 1 half. And then we have 1 minus phi square over two. And then we have, The remaining components. So we have T minus 1 of the 2 pi times the variance, the variance is 1. And then I have the summation. This will go from 2 all the way to capital T, Over 2. So if I were to put things together again I would have this expression. And then I have the exponential. And if I were to maximize this expression again I cannot use the results from linear regression to do this. I have to use numerical methods like Newton Raphson to maximize this function. I can consider the log likelihood. So if I take the log of this, You can see that we will have a term here. That is one half log of this. And then I have the log of this piece here which just gives me, And this expression if you recall we had called this Q star phi. So plus some constant, this piece is just a constant, I’m going to call it K. So if I were to look at maximizing this function, I can think about taking first derivatives with respect to phi. And then you will see that again the expression that you obtain doesn’t allow you to obtain a close form expression for the maximum likelihood estimator of phi. Instead, what you will do is you can use a numerical optimization method such as Newton Raphson to obtain the maximum likelihood estimator for phi. [MUSIC]

90.3 MLE for the AR(1) 🗒️ \mathcal{R}

The following code allows you to compute the MLE of the AR coefficient \psi, the unbiased estimator of v, s^2 , and the MLE of v based on a dataset simulated from an AR(1) process and using the conditional likelihood.

set.seed(2021)
phi=0.9
v=1
sd=sqrt(v)
T=500
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) 

## Case 1: Conditional likelihood
y=as.matrix(yt[2:T])
X=as.matrix(yt[1:(T-1)])
phi_MLE=as.numeric((t(X)%*%y)/sum(X^2))
s2=sum((y - phi_MLE*X)^2)/(length(y) - 1)
v_MLE=s2*(length(y)-1)/(length(y))

cat("\n MLE of conditional likelihood for phi: ", phi_MLE, "\n",
    "MLE for the variance v: ", v_MLE, "\n", 
    "Estimate s2 for the variance v: ", s2, "\n")

1: ar coefficient
2: innovation standard deviation
3: number of time points
4: response variable
5: design matrix
6: MLE for phi
7: Unbiased estimate for v
8: MLE for v


 MLE of conditional likelihood for phi:  0.9261423 
 MLE for the variance v:  1.048 
 Estimate s2 for the variance v:  1.050104

This code allows you to compute estimates of the AR(1) coefficient and the variance using the arima function in R. The first case uses the conditional sum of squares, the second and third cases use the full likelihood with different starting points for the numerical optimization required to compute the MLE with the full likelihood.

# Obtaining parameter estimates using the arima function in R
set.seed(2021)
phi=0.9 # ar coefficient
v=1
sd=sqrt(v) # innovation standard deviation
T=500 # number of time points
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) 

# Using conditional sum of squares, equivalent to conditional likelihood 
arima_CSS=arima(yt,order=c(1,0,0),method="CSS",n.cond=1,include.mean=FALSE)
cat("AR estimates with conditional sum of squares (CSS) for phi and v:", arima_CSS$coef,arima_CSS$sigma2,
"\n")

AR estimates with conditional sum of squares (CSS) for phi and v: 0.9261423 1.048

#Uses ML with full likelihood 
arima_ML=arima(yt,order=c(1,0,0),method="ML",include.mean=FALSE)
cat("AR estimates with full likelihood for phi and v:", arima_ML$coef,arima_ML$sigma2,
"\n")

AR estimates with full likelihood for phi and v: 0.9265251 1.048434

#Default: uses conditional sum of squares to find the starting point for ML and 
#         then uses ML 
arima_CSS_ML=arima(yt,order=c(1,0,0),method="CSS-ML",n.cond=1,include.mean=FALSE)
cat("AR estimates with CSS to find starting point for ML for phi and v:", 
arima_CSS_ML$coef,arima_CSS_ML$sigma2,"\n")

AR estimates with CSS to find starting point for ML for phi and v: 0.9265252 1.048434

This code shows you how to compute the MLE for \psi using the full likelihood and the function optimize in R.

set.seed(2021)
phi=0.9
v=1
sd=sqrt(v)
T=500
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) 

## MLE, full likelihood AR(1) with v=1 assumed known 

log_p <- function(phi, yt){
  0.5*(log(1-phi^2) - sum((yt[2:T] - phi*yt[1:(T-1)])^2) - yt[1]^2*(1-phi^2))
}

result = optimize(log_p, c(-1, 1), 
                  tol = 0.0001, 
                  maximum = TRUE, 
                  yt = yt)
cat("\n MLE of full likelihood for phi: ", result$maximum)

1: AR coefficient
2: Innovation standard deviation
3: Number of time points
4: Log likelihood function
5: Using a built-in optimization method to obtain MLE


 MLE of full likelihood for phi:  0.9265928

90.4 Bayesian inference in the AR(1) 🎥

y_t= \phi y_{t-1} + \varepsilon_t, \quad \varepsilon_t \overset{iid}{\sim} \mathcal{N}(0, v), \quad \phi \in (-1,1) \tag{90.14}

\utilde{y} = \mathbb{X}\utilde{\beta} + \utilde{\varepsilon} \tag{90.15}

where \mathbb{X} is the design matrix with y_1 as the first column and y_2, \ldots, y_{T-1} as the second column, \utilde{\beta} = \phi is the AR coefficient, and \utilde{\varepsilon} \sim \mathcal{N}(0, v\mathbf{I}) is the error term.

\utilde{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{T-1} \end{pmatrix} \tag{90.16}

\mathbb{X} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{T-1} \end{pmatrix} \tag{90.17}

\utilde{\beta} = \phi \tag{90.18}

\utilde{\varepsilon} \sim \mathcal{N}(0, v\mathbf{I}) \tag{90.19}

p(y_{2:T} \mid y_1, \phi, v) = \frac{1}{(2\pi v)^{\frac{T-1}{2}}} \exp\left(-\frac{(\utilde{y} - \mathbb{X}\utilde{\beta})'(\utilde{y} - \mathbb{X}\utilde{\beta})}{2v}\right) \tag{90.20}

p(\phi, v \mid y_{1:T}) \propto p(\phi, v) p(y_{2:T} \mid y_1, \phi, v) \tag{90.21}

\begin{aligned} p(\phi, v) \propto \frac{1}{v} &\cdot (\utilde{\beta} \mid v, \mathbb{X}, \utilde{y}) &\sim & \mathcal{N}(\hat{\utilde{\beta}}, v(\mathbb{X}'\mathbb{X})^{-1}) \\ & \cdot (v \mid \mathbb{X}, \utilde{y}) &\sim & \mathcal{IG}\left(\frac{T-2}{2}, \frac{1}{2}Q(\hat{\utilde{\beta}}_{MLE}) \right) \end{aligned} \tag{90.22}

\begin{aligned} \hat{\utilde{\beta}_{MLE}} &= (\mathbb{X}'\mathbb{X})^{-1}\mathbb{X}'\utilde{y} \\ &= \hat{\phi}_{MLE} = \frac{\sum_{t=2}^T y_t y_{t-1}}{\sum_{t=2}^T y_{t-1}^2} \end{aligned} \tag{90.23}

Q(\hat{\phi}_{MLE}) = \sum_{t=2}^T (y_t - \hat{\phi}_{MLE}\ y_{t-1})^2 \tag{90.24}

Video Transcript

We will now discuss Bayesian inference in the case of the autoregressive model of order one. As you know, in the case of the autoregressive model, you have two options for the likelihood function. I will discuss a model in which you use a likelihood function that is based on the conditional information, so the conditional likelihood. Then you have to pick a prior distribution for the parameters of the model. In this part, we will discuss Bayesian inference with the conditional likelihood and the reference prior making a connection with the regression model. Recalling the AR1 case, we have a model that looks like this, and then the Epsilons we’re assuming they are iid, normally distributed with zero mean and variance V. We have two parameters here, Phi, the AR coefficient, and v the variance. When we work with a conditional likelihood, we discussed we can write this model as a linear regression model, so in this case, what that means is we’re writing the model as y equals x Beta plus some other vector Epsilon. In the case of the AR1 you can write with the conditional likelihood you’re conditioning on the first observation, so your y is the vector that goes from y_2 up to y_T, the total number of observations you have. X in this case is just a vector as well with the past values, and then Beta is just a scalar, in this case you have Phi, and the Epsilon is normally distributed. It’s a multivariate normal in this case. The mean is going to be the vector of zeros here, and the covariance matrix is going to be v, the variance of the process, times the identity. In this case, the identity matrix is going to be of the same dimension that corresponds to the y, so it’s going to be T minus 1 times T minus 1 Identity. Using this form, we can write first the conditional likelihood in terms of this normal distribution. If I think about writing the likelihood here, it’s y_2 up to capital T given Phi and v and y_1. I can write this down in terms of density that looks like this, so I have my one over two Pi v here. Again, this is the v, my y vector here and my Beta here are going to be given by this expression, so this is going to be a function of phi as well, and then I have to pick a prior distribution for the model parameters. We can work with different kinds of priors; we can work with conjugate priors and we can work with non conjugate priors in the case of this likelihood, or the other likelihood. What we’re going to do here is we’re going to use the reference prior for several reasons, one of the most important reasons we’re working with that prior is that you get closed form posterior distributions for the model parameters, and you can do posterior inference in a computationally easy way. The reference prior, if I think about my parameters here, we are going to use something that looks like this. This is my V. Then we’re going to combine. To get the posterior distribution, we just combine the likelihood is proportional to the prior times the likelihood. In this case, again, because we are working with the conditional likelihood, we’re conditioning on the first observation. Combining this piece with this piece, we can get the posterior inference and the results from regression models is that essentially when you have something of this form and you use your reference prior, you’re going to get that the posterior distribution, so there are two important results here. One is that the conditional posterior for beta given v, X and y is normally distributed here. Centered at beta hat, the maximum likelihood estimator. Then with a variance covariance matrix that has the form v, X transpose X inverse. That’s the form of the conditional for beta given v and all the observations. Then the other result that we have is the marginal for v is when we integrate out the beta is going to have a form with this particular prior that leads to an inverse gamma. The inverse gamma prior. I’m going to use this notation to say that this is distributed as an inverse gamma. The parameters of the inverse gamma in the case of the AR(1), with the conditional likelihood, is going to have T minus 2 over 2 here. Then it’s going to be that Q that we defined before, evaluated at this beta MLE over 2. What is the maximum likelihood estimator for beta here? which is, as we say, this phi. Well, the estimator, as we saw last time, you can get it by computing, assuming that the X matrix is full rank. You’re going to get it just using these expressions. You just do X transpose X inverse, X transpose y. As we saw before, in this case, those operations are easy. X transpose X is just multiplying this vector transpose by itself. Then just inverting that. Then we just multiply this one times this. For our beta hat here, which is exactly my phi, my parameter estimate, we get the maximum likelihood estimator, which in the case of the AR is y_t y_{t-1}, t goes from two to capital T over, so this is the expression here. Then for this v, the Q expression that we wrote before, if you recall, we can view this as a quadratic expression. In this case again, the Beta hat MLE is just, the phi hat MLE. That’s going to be just the sum from two to capital T y_t. We now have all the elements to obtain posterior samples from the posterior distribution of phi and v as follows. Since the v is distributed as an inverse Gamma, we can sample from this inverse Gamma distribution. We know what the parameters of the inverse gamma are and then conditional on a value that we sample that v. We can then sample for the distribution of beta, which in this case is the conditional distribution of phi given v, X, and y. We obtain samples for v and phi.

90.5 Bayesian inference in the AR(1): Conditional likelihood example 🎥

This video walks through the code snippet below and provides examples of how to sample from the posterior distribution of the AR coefficient \psi and the variance v using the conditional likelihood and a reference prior.

Video Transcript

In this code, we discuss Bayesian inference for the case of the AR(1) process assuming a conditional likelihood and a reference prior. I’m going to go ahead and just generate some data. Here I’m sampling 200 observations from an AR(1) with AR coefficient 0.9, and standard deviation one or variance one for the observational variance. Here, just recall how we compute the maximum likelihood estimator for the AR(1) using the conditional likelihood and making the correspondence to the regression setting. Here I’m setting up my y response vector, my X matrix, and then I just run the equations to obtain the two estimates. The maximum likelihood estimate for the AR coefficient in this case, and then I also get an estimate for the observation and variance that is unbiased. Now, if we want to proceed with Bayesian inference using the conditional likelihood and the reference prior, we obtain closed form inference in this case. We can use direct sampling by first sampling the variance parameter and then conditional on the variance, we can sample the AR coefficients. Here I’m just going to set up the sample to be 3000. I’m going to obtain 3000 samples from the posterior distribution. The first step is to sample from the posterior of v, which is an inverse Gamma distribution with the corresponding parameters. Recall here we have a degrees of freedom parameter, and then we have the other parameter for the inverse gamma. I’m sampling these 3000 samples and now conditional on each of those, I’m going to sample from a normal distribution. This piece of code is just doing that. Just sampling from the normal for each of the variances, for each of the samples of the variances here. I can plot now a histogram of the posterior samples for phi. This is my histogram for the posterior sample for phi. If you remember, the process was originally sampled from an AR with coefficient 0.9, so this is my posterior. Here I can also show a histogram of the posterior for the variance. Again recall, in this case, we were sampling from an AR process with observation and variance one. This is our posterior inference based on the reference prior and the conditional likelihood.

90.6 AR(1) Bayesian inference, conditional likelihood example 🗒️ \mathcal{R}

####################################################
#####             MLE for AR(1)               ######
####################################################
set.seed(2021)
phi=0.9 # ar coefficient
sd=1 # innovation standard deviation
T=200 # number of time points
yt=arima.sim(n = T, model = list(ar = phi), sd = sd) # sample stationary AR(1) process

y=as.matrix(yt[2:T]) # response
X=as.matrix(yt[1:(T-1)]) # design matrix
phi_MLE=as.numeric((t(X)%*%y)/sum(X^2)) # MLE for phi
s2=sum((y - phi_MLE*X)^2)/(length(y) - 1) # Unbiased estimate for v
v_MLE=s2*(length(y)-1)/(length(y)) # MLE for v 

print(c(phi_MLE,s2))

[1] 0.9178472 1.0491054

#######################################################
######     Posterior inference, AR(1)               ###
######     Conditional Likelihood + Reference Prior ###
######     Direct sampling                          ###
#######################################################

n_sample=3000   # posterior sample size

## step 1: sample posterior distribution of v from inverse gamma distribution
v_sample=1/rgamma(n_sample, (T-2)/2, sum((yt[2:T] - phi_MLE*yt[1:(T-1)])^2)/2)

## step 2: sample posterior distribution of phi from normal distribution
phi_sample=rep(0,n_sample)
for (i in 1:n_sample){
phi_sample[i]=rnorm(1, mean = phi_MLE, sd=sqrt(v_sample[i]/sum(yt[1:(T-1)]^2)))}

## plot histogram of posterior samples of phi and v
par(mfrow = c(1, 2),mar = c(3, 4, 2, 1), cex.lab = 1.3)
hist(phi_sample, xlab = bquote(phi), 
     main = bquote("Posterior for "~phi),xlim=c(0.75,1.05), col='lightblue')
abline(v = phi, col = 'red')
hist(v_sample, xlab = bquote(v), col='lightblue', main = bquote("Posterior for "~v))
abline(v = sd, col = 'red')

Overthinking Bayesian inference in the AR(1)

🤔 Just pondering Why the red line (MLE estimates I presume) off center i.e. not the same as the posterior mean for \phi and v. in this chart.

We will see a far greater discrepancy between the two in the coding assignment, where the posterior mean is very different from the MLE.

One idea is that this is due to the posterior mean being influenced by the prior distribution, while the MLE is solely based on the data. In this case, the prior is a reference prior, which does not have a strong influence on the posterior distribution, yet has a marked effect on the posterior mean.

90.6.1 Bayesian Inference in the AR(1), : full likelihood example 🗒️

We consider a prior distribution that assumes that \phi and v are independent:

\mathbb{P}r(v) \propto \frac{1}{v},

\mathbb{P}r(\phi) = \frac{1}{2}, \quad \text{for } \phi \in (-1, 1),

i.e., we assume a Uniform prior for \phi \in (-1, 1). Combining this prior with the full likelihood in the AR(1) case, we obtain the following posterior density:

\mathbb{P}r(\phi, v \mid y_{1:T}) \propto \frac{(1 - \phi^2)^{1/2} }{v^{T/2 + 1}} \exp\left(-\frac{Q^*(\phi)}{2v}\right), \quad -1 < \phi < 1,

with

Q^*(\phi) = y_1^2(1 - \phi^2) + \sum_{t=2}^{T} (y_t - \phi y_{t-1})^2.

It is not possible to get a closed-form expression for this posterior or to perform direct simulation. Therefore, we use simulation-based Markov Chain Monte Carlo (MCMC) methods to obtain samples from the posterior distribution.

90.6.2 Transformation of \phi

We first consider the following transformation on \phi:

\eta = \log\left(\frac{1 - \phi}{\phi + 1}\right),

so that \eta \in (-\infty, \infty). The inverse transformation on \eta is:

\phi = \frac{1 - \exp(\eta)}{1 + \exp(\eta)}.

Writing down the posterior density for \eta and v, we obtain

\mathbb{P}r(\eta, v \mid y_{1:T}) \propto\frac{ (1 - \phi^2)^{1/2} }{v^{T/2 + 1}} \exp\left(-\frac{Q^*(\phi)}{2v}\right) \cdot \frac{2 \exp(\eta)}{(1 + \exp(\eta))^2},

with \phi written as a function of \eta. We proceed to obtain samples from this posterior distribution using the MCMC algorithm outlined below. Once we have obtained M samples from \eta and v after convergence, we can use the inverse transformation above to obtain posterior samples for \phi.

90.6.3 MCMC Algorithm: Bayesian Inference for AR(1), Full Likelihood

Algorithm:

Initialize \eta^{(0)} and \beta^{(0)}.
For m in 1:M do:
- Sample v^{(m)} \sim \text{IG}\left(\frac{T}{2}, \frac{Q^*(\phi^{(m-1)})}{2}\right).
- Sample \eta^{(m)} using Metropolis-Hastings:
  1. Sample \eta^* \sim N(\eta^{(m-1)}, c), where c is a tuning parameter.
  2. Compute the importance ratio:

r = \frac{p(\eta^*, v^{(m)} \mid y_{1:T})}{p(\eta^{(m-1)}, v^{(m)} \mid y_{1:T})}.

Set:

\eta^{(m)} = \begin{cases} \eta^* & \text{with probability } \min(r, 1), \\ \eta^{(m-1)} & \text{otherwise}. \end{cases}