Sherman-Morrison-Woodbury | Effect of low rank changes to matrix inverse

October 1, 2020 by Dibya Prakash Das

I recently came across this tweet about the Sherman-Morrison-Woodbury formula (henceforth referred to as SMW in this post). I was reading linear regression and I realized that this formula has a very practical application there. I will highlight the formula and briefly explain one of its applications. The Sherman-Morrison-Woodbury formula is :

$$(A + uv^T)^{-1} = A^{-1} - \frac{A^{-1}uv^TA^{-1}}{1+v^TA^{-1}u}$$

where $A$ is n $\times$ n matrix and $u$ and $v$ are both n $\times$ 1 matrix. Looking at this briefly, one can see that this formula gives a way of calculating the inverse of a matrix that has been perturbed by $uv^T$ in terms of $A^{-1}$. In fact, it’s easier to understand this formula if we just replace $A$ with $I$ and interpret what the formula does.

$$(I + uv^T)^{-1} = I^{-1} - \frac{I^{-1}uv^TI^{-1}}{1+v^TI^{-1}u} = I - \frac{uv^T}{1+v^Tu}$$

We first note that $uv^T$ is a rank-1 $n \times n$ matrix

It’s a good exercise to try this and convince yourself that it really is a rank-1 matrix despite being $n \times n$ dimensional. In fact, the rank of the matrix is fundamentally defined as the number of linearly independent columns present in the matrix so it doesn’t matter what are its dimensions.

SMW gives us a way to calculate the inverse of a matrix that has been perturbed by a rank-1 matrix in terms of rank-1 changes to the inverse of the original matrix. And thus, it provides a cheap way of computing the inverse of $A + uv^T$ which would have been otherwise expensive if we had computed from scratch.

One of its widely used applications is in the least-squares/linear regression problems. I will highlight how SMW is used in that case. For the linear regression problem, we have $Y = Xw$ where $X$ is a $N \times D$ matrix with $N$ rows each of which is a transpose of input vector of dimension $D \times 1$, $Y$ is $N \times 1$ target value matrix and $w$ is $D \times 1$ weight matrix and $N >= D$. From maximum likelihood estimation, we know that the best value of $w$ that minimizes the sum-of-squares error $||Y - Xw||$ is given by the normal equation,

$$w^* = (X^TX)^{-1}X^TY $$

Here $X^TX$ is $D \times D$ matrix which you can think of as the covariance matrix. I have taken the basis function $\phi$ to be identity functions for brevity. So assume that we have calculated the best fit $w^{*}$ by this formula for the $N$ points that we have. What if collect a new input data point? Which is often the case in real-world scenarios (ex. GPS locations coming from a sensor). Now we have $X^\prime$ which is a $(N+1) \times D$ matrix. To calculate $w^{*}$ we have to invert $((X^\prime)^TX^\prime)$. The inversion of a matrix is an expensive operation of the order of $O(n^3)$. We already have $(X^TX)^{-1}$. Can we somehow use this to get $((X^\prime)^TX^\prime)^{-1}$. Turns out we can! This is where we can use SMW to compute $((X^\prime)^TX^\prime)^{-1}$ in terms of $(X^TX)^{-1}$. But first, let us see what is the perturbation to $X^TX$. We first show that $X^TX$ is a sum of $N$ rank-1 matrices. The $i$th row of $X$ is $1 \times D$ vector $x_i^T$. And the $j$th column of $X^T$ is nothing but $D \times 1$ vector $x_i$. And from matrix multiplication we can show that, $X^TX = \sum\limits_{i=0}^N x_ix_i^T$. Each of the $x_ix_i^T$ is rank-1 matrix.
Similarly, $(X^\prime)^TX^\prime = \sum\limits_{i=0}^{N+1} x_ix_i^T = \sum\limits_{i=0}^{N} x_ix_i^T + x_{n+1}x_{n+1}^T = X^TX + x_{n+1}x_{n+1}^T$. So we have shown that $(X^\prime)^TX^\prime$ is equal to $X^TX$ plus a rank-1 perturbation. Keeping this in mind, we can now apply SMW formula to get the inverse of $(X^\prime)^TX^\prime$

$$((X^\prime)^TX^\prime)^{-1} = (X^TX)^{-1} - \frac{(X^TX)^{-1}x_{n+1}x_{n+1}^T(X^TX)^{-1}}{1+x_{n+1}^T(X^TX)^{-1}x_{n+1}}$$

This second term on the RHS can be shown to have a rank 1.

And with this, we can now proceed to calculate the new $w_{new}^{*}$ using the above expression for $((X^\prime)^TX^\prime)^{-1}$. The SMW formula gives us a much faster way to update our W matrix as and when new points arrive. In fact, a generalized version of the formula lets us compute the inverse of matrix perturbed by a rank-k matrix in terms of rank-k changes to the original matrix’s inverse.

$$ (A+UV)^{-1} = A^{-1} - A^{-1}U(I_k+VA^{-1}U)^{-1}VA^{-1} $$

So instead of the above scenario, where we had only 1 new data point, imagine if we got the new data points in batches of $M$. We can then compute $UV$ where $U$ is $D \times M$ matrix whose each column represents a D-dimensional data point. And $V$ is $U^T$. With that setup, we can compute $((X^\prime)^TX^\prime)^{-1}$ when we have M new data points using the above generalized SMW formula (provided that $(I_k+VA^{-1}U)$ is invertible in the first place).

Sherman-Morrison-Woodbury has many other applications and one of them is the widely used Kalman filters. In fact, there is the matrix determinant lemma which enables us to calculate the determinant of $A + uv^T$ in terms of $A^{-1}$ and $\det(A)$. Used in conjunction with SMW, the inverse and determinant of a matrix perturbed by rank-1 changes can be efficiently computed.

What a wonderful mathematical tool! I am glad I came across that tweet and got to know about this.

References:

Hucore theme & Hugo ♥