Least squares | Nicholas Hu

Projections and least squares problems

Sat, 22 Feb 2025 00:00:00 +0000

$$ \newcommand{\set}[1]{\{ #1 \}} \newcommand{\Set}[1]{\left \{ #1 \right\}} \renewcommand{\emptyset}{\varnothing} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\Rn}{\mathbb{R}^n} \newcommand{\Rm}{\mathbb{R}^m} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\Abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\inner}[2]{\langle #1, #2 \rangle} \newcommand{\Inner}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\Norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\tp}{{\top}} \newcommand{\trans}{{\top}} \newcommand{\span}{\operatorname{span}} \newcommand{\im}{\operatorname{im}} \newcommand{\ker}{\operatorname{ker}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\proj}{\operatorname{proj}} \newcommand{\proj}[1]{\mathop{\mathrm{proj}_{#1}}} \newcommand{\K}{\mathcal{K}} \newcommand{\L}{\mathcal{L}} \renewcommand{\epsilon}{\varepsilon} \definecolor{cblue}{RGB}{31, 119, 180} \definecolor{corange}{RGB}{255, 127, 14} \definecolor{cgreen}{RGB}{44, 160, 44} \definecolor{cred}{RGB}{214, 39, 40} \definecolor{cpurple}{RGB}{148, 103, 189} \definecolor{cbrown}{RGB}{140, 86, 75} \definecolor{cpink}{RGB}{227, 119, 194} \definecolor{cgrey}{RGB}{127, 127, 127} \definecolor{cyellow}{RGB}{188, 189, 34} \definecolor{cteal}{RGB}{23, 190, 207} $$

Projections

Let $H$ be a Hilbert space and $Y \subseteq H$. The (orthogonal) projection operator onto $Y$ is defined for $x \in H$ by $$ \proj{Y}(x) := \underset{y \in Y}{\operatorname{argmin}} \frac{1}{2} \norm{y - x}^2. $$

Hilbert projection theorem (first projection theorem)

If $Y$ is nonempty, closed, and convex, then $\proj{Y}(x)$ is a singleton (so $\proj{Y} : H \to Y$ is well-defined).

Proof. Let $(y_n)_{n=1}^\infty \subseteq Y$ be such that $d_n := \frac{1}{2} \norm{y_n - x}^2 \to d := \inf_{y \in Y} \frac{1}{2} \norm{y-x}^2$. By the parallelogram identity, $$ \Norm{\frac{y_m + y_n}{2} - x}^2 + \Norm{\frac{y_m - y_n}{2}}^2 = 2\Norm{\frac{y_m - x}{2}}^2 + 2\Norm{\frac{y_n - x}{2}}^2 = d_m + d_n, $$ where $\norm{\frac{y_m + y_n}{2} - x}^2 \geq 2d$ by convexity. Taking $m, n \to \infty$ shows that $(y_n)$ is Cauchy and therefore convergent to some $y \in Y$ with $\frac{1}{2} \norm{y - x}^2 = d$. Moreover, if $y’ \in Y$ is another minimizer, replacing $y_m, y_n$ by $y, y'$ above shows that $y = y’$. ∎

Recall that the polar cone of $Y$ is $Y^\circ := \set{x \in H : \forall y \in Y \, (\Re(\inner{x}{y}) \leq 0)}$ and that the orthogonal complement of $Y$ is $Y^\perp := \set{x \in H : \forall y \in Y \, (\inner{x}{y} = 0)}$; clearly, if $Y$ is a subspace of $H$, then $Y^\circ = Y^\perp$.

Characterization of projections (second projection theorem)

If $Y$ is nonempty, closed, and convex, then $y = \proj{Y}(x)$ if and only if $y \in Y$ and $x-y \in (Y-y)^\circ$.

Proof. If $y = \proj{Y}(x)$ and $y’ \in Y$, then for all $\lambda \in [0, 1]$, we have $$ \norm{y-x}^2 \leq \norm{(1-\lambda)y + \lambda y' - x}^2 = \norm{y-x}^2 + 2\lambda \Re(\inner{y-x}{y'-y}) + \lambda^2 \norm{y' - y}^2, $$ so $\Re(\inner{y-x}{y'-y}) \geq 0$. Conversely, if $y, y' \in Y$ and $x-y \in (Y-y)^\circ$, then setting $\lambda = 1$ in the inequality above shows that $y = \proj{Y}(x)$. ∎

Firm nonexpansiveness of the projection operator

If $Y$ is nonempty, closed, and convex, then $$ \norm{\proj{Y}(x) - \proj{Y}(x’)}^2 + \norm{(I-\proj{Y})(x) + (I-\proj{Y})(x’)}^2 \leq \norm{x-x’}^2. $$

Proof. Let $y = \proj{Y}(x)$ and $y’ = \proj{Y}(x’)$, and add the inequalities $\Re(\inner{x'-y'}{y-y'}) \leq 0$ and $\Re(\inner{x-y}{y'-y}) \leq 0$. ∎

In particular, this implies that the projection operator is nonexpansive: $\norm{\proj{Y}(x) - \proj{Y}(x')} \leq \norm{x-x'}$.

If $Y$ is a closed subspace of $H$, it follows from the above that $y = \proj{Y}(x)$ if and only if $y \in Y$ and $x-y \in Y^\perp$, and that $\proj{Y} : H \to Y$ is a linear operator with $\norm{\proj{Y}} \leq 1$, $\im(\proj{Y}) = Y$, and $\ker(\proj{Y}) = Y^\perp$. In addition, $\proj{Y^\perp} = I-\proj{Y}$.

Least squares problems

Let $H_1$ and $H_2$ be Hilbert spaces and suppose that $A : H_1 \to H_2$ is a continuous linear operator with closed image.¹ The (linear) least squares problem is that of finding an $x \in H_1$ that minimizes $\frac{1}{2} \norm{b - Ax}^2$ for a given $b \in H_2$, or equivalently, that satisfies $Ax = \proj{\im(A)} b$. Using the fact that $\im(A)^\perp = \ker(A^*)$, we can also write this as the normal equation $A^*Ax = A^*b$.

The pseudoinverse

To solve the least squares problem, we observe that $A\restriction_{\ker(A)^\perp} : \ker(A)^\perp \to \im(A)$ is bijective since $Ax = Ax’$ implies that $x-x’ \in \ker(A)$ and $y = Ax$ implies that $y = A (x - \proj{\ker(A)} x)$. Thus, the pseudoinverse $A^+ : H_2 \to H_1$ of $A$, defined as $$ A^+ := A\restriction_{\ker(A)^\perp}^{-1} \circ \proj{\im(A)}, $$ is a well-defined continuous linear operator, and by construction $x^* := A^+ b$ is a solution to the least squares problem.

This solution need not be unique; however, it is the unique solution of minimal norm because $x - x^* \in \ker(A)$ for any solution $x$, so $\norm{x}^2 = \norm{x-x^*}^2 + \norm{x^*}^2 \geq \norm{x^*}^2$ with equality if and only if $x = x^*$.

It is straightforward to verify that:

$A^+ = A^{-1}$ if $A$ is bijective
$\im(A^+) = \ker(A)^\perp$, $\ker(A^+) = \im(A)^\perp$
$AA^+ = \proj{\im(A)}$, $A^+A = \proj{\im(A^+)}$ (and in fact, these characterize the pseudoinverse)
$(A^+)^+ = A$
$(A^*)^+ = (A^+)^*$
$A^+ = (A^* A)^+ A^* = A^* (AA^*)^+$

In the finite-dimensional case, if $A \in \C^{m \times n}$ has full column rank, then $A^+ = (A^* A)^{-1} A^*$ by the identities above; similarly, if it has full row rank, then $A^+ = A^* (AA^*)^{-1}$. More generally, if $\hat{U} \hat{\Sigma} \hat{V}^*$ is a compact SVD of $A$ (that is, $\hat{\Sigma}$ is $r \times r$, where $r = \rank(A)$), then $A^+ = \hat{V} \hat{\Sigma}^{-1} \hat{U}^*$.

Note that this implies that $A^*$ also has closed image, so $\im(A)^\perp = \ker(A^*)$ and $\ker(A)^\perp = \overline{\im(A^*)} = \im(A^*)$. ↩︎

The QR factorization

Tue, 04 Mar 2025 00:00:00 +0000

PDF

$$ \newcommand{\set}[1]{\{ #1 \}} \newcommand{\Set}[1]{\left \{ #1 \right\}} \renewcommand{\emptyset}{\varnothing} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\Rn}{\mathbb{R}^n} \newcommand{\Rm}{\mathbb{R}^m} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\Abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\inner}[2]{\langle #1, #2 \rangle} \newcommand{\Inner}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\Norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\tp}{{\top}} \newcommand{\trans}{{\top}} \newcommand{\span}{\operatorname{span}} \newcommand{\im}{\operatorname{im}} \newcommand{\ker}{\operatorname{ker}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\proj}{\operatorname{proj}} \newcommand{\proj}[1]{\mathop{\mathrm{proj}_{#1}}} \newcommand{\refl}{\operatorname{refl}} \newcommand{\refl}[1]{\mathop{\mathrm{refl}_{#1}}} \newcommand{\K}{\mathcal{K}} \newcommand{\L}{\mathcal{L}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\conj}{\overline} \newcommand{\sign}{\operatorname{sign}} \definecolor{cblue}{RGB}{31, 119, 180} \definecolor{corange}{RGB}{255, 127, 14} \definecolor{cgreen}{RGB}{44, 160, 44} \definecolor{cred}{RGB}{214, 39, 40} \definecolor{cpurple}{RGB}{148, 103, 189} \definecolor{cbrown}{RGB}{140, 86, 75} \definecolor{cpink}{RGB}{227, 119, 194} \definecolor{cgrey}{RGB}{127, 127, 127} \definecolor{cyellow}{RGB}{188, 189, 34} \definecolor{cteal}{RGB}{23, 190, 207} $$

Let $A \in \C^{m \times n}$. The QR factorization is a factorization of $A$ as $QR$, where $Q \in \C^{m \times m}$ is unitary and $R \in \C^{m \times n}$ is (rectangular) upper triangular.¹ We will show below that such a factorization always exists by describing three different methods to compute it.

When $A$ has full column rank, we have $a_j = \sum_{i \leq j} r_{ij} q_i$ for each $j$, so $\span \, \set{a_j}_{j \leq k} \subseteq \span \, \set{q_j}_{j \leq k}$ for each $k$. As these subspaces are both $k$-dimensional, they must be equal, which also implies that the diagonal entries of $R$ are nonzero. Moreover, if $\hat{Q}$ denotes the left $m \times n$ submatrix of $Q$ and $\hat{R}$ denotes the upper $n \times n$ submatrix of $R$, we have the thin/reduced QR factorization $A = \hat{Q} \hat{R}$.

The thin QR factorization of a full column rank matrix is nearly unique in the sense that if $A = \tilde{Q} \tilde{R}$ for some $\tilde{Q} \in \C^{m \times n}$ with orthonormal columns and some upper triangular $\tilde{R} \in \C^{n \times n}$, then $\tilde{Q} = \hat{Q}D$ and $\hat{R} = D\tilde{R}$ for some diagonal matrix $D$ whose diagonal entries have unit modulus. This follows from the observation that $D := \hat{Q}^* \tilde{Q} = \hat{R} \tilde{R}^{-1} = \hat{R}^{-*} \tilde{R}^*$ must be both upper and lower triangular. Thus, if we specify a (complex) sign for each diagonal entry of $\hat{R}$, the factorization is unique.

Gram–Schmidt orthogonalization

Suppose that $(a_j)_{j \geq 1}$ is a sequence of vectors in a Hilbert space $V$. Gram–Schmidt orthogonalization defines an orthogonal sequence of vectors $(b_j)_{j \geq 1}$ in $V$ such that $\mathcal{A}_k := \span \, \set{a_j}_{j \leq k} = \mathcal{B}_k := \span \, \set{b_j}_{j \leq k}$ for each $k$. To wit, let $\proj{b} := \proj{\span{\set{b}}}$ for $b \in H$; that is, $$ \proj{b} a = \begin{cases} \frac{\inner{a}{b}}{\inner{b}{b}} b & \text{if $b \neq 0$}, \\ b & \text{if $b = 0$}. \end{cases} $$ We then inductively define $$ b_j := a_j - \sum_{i < j} \proj{b_i} a_j. $$ Assuming that $\set{b_j}_{j < k}$ is orthogonal for a given $k$, we then have $\inner{b_k}{b_j} = \inner{a_k - \sum_{i < k} \proj{b_i} a_k}{b_j} = \inner{a_k - \proj{b_j} a_k}{b_j} = 0$ for all $j < k$, which shows that $\set{b_j}_{j \leq k}$ is orthogonal. Moreover, if $\mathcal{A}_{k-1} = \mathcal{B}_{k-1}$, then $b_k \in a_k - \mathcal{B}_{k-1} = a_k - \mathcal{A}_{k-1} \subseteq \mathcal{A}_k$ and $a_k \in b_k + \mathcal{B}_{k-1} \subseteq \mathcal{B}_k$, so $\mathcal{A}_k = \mathcal{B}_k$.

To compute a QR factorization of $A$, we can apply Gram–Schmidt orthogonalization to the columns of $A =: \begin{bmatrix} a_1 & \cdots & a_n \end{bmatrix}$ as follows. For each $j \leq m$, we inductively define $b_j := a_j - \sum_{i < j} \proj{q_i} a_j$ if $j \leq n$ and the right-hand expression is nonzero; otherwise, we select an arbitrary nonzero $b_j \in \mathcal{B}_{j-1}^\perp$. In either case, we then define $q_j := \frac{b_j}{\norm{b_j}}$. We thereby obtain an orthonormal basis $\set{q_j}_{j \leq m}$ of $\C^m$ such that $a_j = \sum_{i \leq \min \set{j,\,m}} r_{ij} q_i$ for some $r_{ij} \in \C$, as required.

Modified Gram–Schmidt orthogonalization

In Gram–Schmidt orthogonalization, we define $b_j = (I - \sum_{i < j} \proj{b_i}) a_j$. Since the $b_i$ are orthogonal, this can equivalently be written as $b_j = (I - \proj{b_{j-1}}) \cdots (I - \proj{b_2}) (I - \proj{b_1}) a_j$, so computationally speaking, the projection operator $I - \proj{b_i}$ can be applied to all $a_j$ with $i < j$ (assuming there are finitely many of them) as soon as $b_i$ is generated. The resulting algorithm is known as modified Gram–Schmidt orthogonalization and exhibits greater numerical stability than “classical” Gram–Schmidt orthogonalization.

Householder reflections

Suppose that $v$ is a nonzero vector in a Hilbert space $V$. The reflection operator across the hyperplane $\set{v}^\perp$ is defined for $x \in H$ by $$ \refl{v} x := (I - 2\proj{v}) \, x = x - \frac{2\inner{x}{v}}{\inner{v}{v}} v. $$ Since $\proj{v}$ is idempotent and self-adjoint, $\refl{v}$ is involutory and self-adjoint and therefore unitary.

A Householder reflection is a reflection operator $H : \C^d \to \C^d$ that zeroes out all components of some vector $x$ except for its first component $x_1$; we assume that the other components are not already all zeroes. In other words, $Hx = \alpha e_1$ for some $\alpha \in \C$, where $e_1 := \begin{bmatrix} 1 & 0 & \cdots & 0 \end{bmatrix}^\tp$ and $x \notin \span \, \set{e_1}$.

As $H$ is unitary and self-adjoint, we must have $\abs{\alpha} = \norm{x}$ and $\inner{Hx}{x} = \alpha \conj{x_1} \in \R$, which implies that $\alpha = \pm \sign(x_1) \norm{x}$ (unless $x_1 = 0$, in which case the only constraint is $\abs{\alpha} = \norm{x}$). Since $\refl{w} x = \alpha e_1$ if and only if $\frac{2\inner{x}{w}}{\inner{w}{w}} w = x - \alpha e_1$, using the Householder vector $v := x - \alpha e_1$ guarantees that $H := \refl{v}$ satisfies $Hx = \alpha e_1$. A conventional choice of $\alpha$ in this context is $\alpha = -\sign(x_1) \norm{x}$ so as to maximize $\norm{v}^2 = 2(\norm{x}^2 \mp \abs{x_1} \norm{x})$ for the sake of numerical stability.

To compute a QR factorization of $A$, we can apply Householder reflections successively to introduce zeroes below the diagonal in each column of $A$. More precisely, we can find a Householder reflection $H \in \C^{m \times m}$ such that $$ HA = \begin{bmatrix} \alpha & b^\tp \\ & A' \end{bmatrix}, $$ where $\alpha \in \C$, $b \in \C^{n-1}$, and $A’ \in \C^{(m-1) \times (n-1)}$ (allowing $H = I$ if the subdiagonal entries in the first column of $A$ are already zero). Now supposing inductively that $A’$ has a QR factorization $Q’ R’$, we obtain the factorization $$ A = \underbrace{H^* \begin{bmatrix} 1 & \\ & Q' \end{bmatrix}}_{Q} \underbrace{\begin{bmatrix} \alpha & b^\tp \\ & R' \end{bmatrix}}_{R}. $$

Givens rotations

Given $a, b \in \C$, consider the problem of finding a $U \in \mathrm{SU}(2)$ and an $r \in \C$ such that $U \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} r \\ 0 \end{bmatrix}$. We have $$ U = \begin{bmatrix} c & s \\ -\conj{s} & \conj{c} \end{bmatrix}, \quad \text{where $\abs{c}^2 + \abs{s}^2 = 1$} $$ and $ac + bs = r$, $b\conj{c} - a\conj{s} = 0$. Since $U$ is unitary, we must have $r = \omega \sqrt{\abs{a}^2 + \abs{b}^2}$ for some $\omega \in \C$ with $\abs{\omega} = 1$, and assuming that $r \neq 0$ (which is to say that $a$ and $b$ are not both zero), we obtain $$ c = \frac{\conj{a}}{\conj{r} \vphantom{\sqrt{\abs{a}^2 + \abs{b}^2}}} = \frac{\omega \conj{a}}{\sqrt{\abs{a}^2 + \abs{b}^2}}, \quad s = \frac{\conj{b}}{\conj{r} \vphantom{\sqrt{\abs{a}^2 + \abs{b}^2}}} = \frac{\omega \conj{b}}{\sqrt{\abs{a}^2 + \abs{b}^2}}. $$ A conventional choice in this context is $\omega = \sign(a)$, along with $U = I$ (and $r = 0$) in the case $a = b = 0$.

Thus, if $a$ and $b$ are the $i$^th and $j$^th components of some $x \in \C^m$, where $i < j$, the Givens rotation $$ G := \begin{bmatrix} I_{i-1} \\ & c & & s \\ & & I_{(j-1)-i} & \\ & -\conj{s} & & \conj{c} \\ & & & & I_{m-j} \end{bmatrix} $$ is a unitary matrix such that the $j$^th component of $Gx$ is zero. (In the real-valued setting, $G$ is indeed a rotation in the $x_i$- $x_j$ plane.) Such rotations can evidently be applied to compute a QR factorization of $A$ by introducing zeroes below the diagonal of $A$ one at a time.

If $A \in \R^{m \times n}$, a QR factorization is defined analogously; i.e., with $Q$ orthogonal. ↩︎