The Moore-Penrose inversion and the Cholesky decomposition are two important methods in computational linear algebra for solving linear equations efficiently. They are widely used in various applications, including statistics, machine learning, and numerical analysis.
N.1 Properties of transpose
transpose of a row vector is a column vector and vice versa:
kA^{T} = {kA}^\top \qquad \text{scalar multiplication } \tag{N.1}
(A^{T})^\top = A \qquad \text { involution } \tag{N.2}
(A+B)^{T} = A^\top + B^\top \qquad\text { distributivity under addition } \tag{N.3}
(AB)^\top = B^\top A^\top \qquad\text { anti } \tag{N.4}
note that we swap the order of the matrices in the product when taking the transpose.
if A is a square matrix, then the following are equivalent: Sq = A^\top*A = A*A^\top where Sq is a symmetric positive definite matrix.
Sk = A^\top*A = A*A^\top
N.2 Full Rank
A matrix is said to be of full row rank if its rows are linearly independent, and it is of full column rank if its columns are linearly independent. A matrix is said to be of full rank if it is either of full row rank or full column rank.
N.3 Generalized (Moore-Penrose) Inverse
The Moore-Penrose inversion is a method for computing the pseudoinverse of a matrix. The pseudoinverse is a generalization of the inverse of a matrix that can be used to solve linear equations when the matrix is rectangular, not-invertible or even singular.
Definition N.1 (Definition of the Moore-Penrose Inverse 1) The Moore–Penrose inverse of the m × n matrix A is the n × m matrix, denoted by A^+, which satisfies the conditions
AA+ A = A \tag{N.5}
A^+AA^+ = A^+ \tag{N.6}
(AA^+ )' = AA^+ \tag{N.7}
(A^+A)' = A^+A \tag{N.8}
An important features of the Moore–Penrose inverse, is that it is uniquely defined.
Corresponding to each m × n matrix A, one and only one n × m matrix A^+ exists satisfying conditions (Equation N.5)–(Equation N.8).
Definition Definition N.1 is the definition of a generalized inverse given by Penrose (1955).
The following alternative definition, which we will find useful on some occasions, utilizes properties of the Moore–Penrose inverse that were first illustrated by Moore (1935).
Definition N.2 (Definition of the Moore-Penrose Inverse 2) Let A be an m × n matrix. Then the Moore–Penrose inverse of A is the unique n × m matrix A^+ satisfying
AA^+ = P_{R(A)} \tag{N.9}
A^+ A = P_{R(A^+)} \tag{N.10}
where P_{R(A)} and P_{R(A^+)} are the projection matrices of the range spaces of A and A^+, respectively.
Theorem N.1 (Properties of the Moore-Penrose inverse) Let A be an m \times n matrix. Then:
- (\alpha_A)^+ = \alpha^{-1} A^+ , \text{ if } \alpha \ne 0 \text{ is a scalar}
- (A^\top)^+ = (A^+)^\top
- (A^+)^+ = A
- A^+ = A^{-1} ,\text{if A is square and nonsingular}
- (A^\top A)^+ = A^+ A^\top and (AA^\top)^+ = A^\top A^+
- (AA^+)^+ = AA^+ and (A^+ A)^+ = A^+ A
- A^+ = (A^\top A)^+ A^\top = A^\top (AA^\top)^+
- A^+ = (A^\top A)^{-1} A^\top and A^+ A = I_n , \text{ if } rank(A) = n
- A^+ = A^\top (AA^\top)^{-1} and AA^+ = I_m , \text{ if } rank(A) = m
- A^+ = A^\top if the columns of A are orthogonal, that is, A^\top A = I_n
Theorem N.2 (Rank of Moore-Penrose Inverse) For any m \times n matrix A, \text{rank}(A) = \text{rank}(A^+) = \text{rank}(AA^+) = \text{rank}(A^+ A).
Theorem N.3 (Symmetric Moore-Penrose Inverse) Let A be an m × m symmetric matrix. Then
- A^+ is also symmetric,
- AA^+ = $A^+A,
- A^+ = A, if A is idempotent.
The Moore-Penrose inverse is particularly useful in maximum likelihood estimation (MLE) for linear models. In MLE, we often need to solve linear equations of the form Ax = b, where A is the design matrix and b is the response vector. If A is not full rank or is not square, we can use the Moore-Penrose inverse to find a solution that minimizes the residual sum of squares.
In the context of MLE, the Moore-Penrose inverse allows us to obtain parameter estimates even when the design matrix is singular or when there are more predictors than observations. This is achieved by projecting the response vector onto the column space of the design matrix, leading to a solution that is consistent and has desirable statistical properties.
we start with:
y = \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon} \qquad \mathbf{\varepsilon} \sim \mathcal{N} (0, v\mathbf{I})
we want MLE of \mathbf{\beta}, which is given by: \hat{\mathbf{\beta}}_{NKE} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top y
Also in the bayesian setting we can show that MLE is equivalent to minimizing the negative log-likelihood function, under a uniform prior on \mathbf{\beta}, which is equivalent to minimizing the residual sum of squares.
we can show that if we use least squares AKA l_2 norm minimization, we will end up with the Moore-Penrose inverse to find the solution:
we can write this explicitly as:
\mathbb{E}_{l_2}(\mathbf{\beta}) = \frac{1}{2} \sum (y - \mathbf{\beta}^\top \mathbf{X})^2