MIT A 2020 Vision of Linear Algebra (2/2)
Part 3: Orthogonal Vectors
orthogonal which means perpendicular.
orthogonal matrices $Q$. Every column $q_i$ of $Q$ is perpendicular to every other column (Orthogonal unit vector).
\[Q^TQ=\begin{bmatrix} -&q_1^T&-\\&:&\\-&q_n^T&- \end{bmatrix}\begin{bmatrix} &&\\q_1&..&q_n\\&& \end{bmatrix}=\begin{bmatrix} 1&&&0\\&1&&\\&&.&\\0&&&1 \end{bmatrix}=I_n\] \[QQ^T=\begin{bmatrix} &&\\q_1&..&q_n\\&& \end{bmatrix}\begin{bmatrix} -&q_1^T&-\\&:&\\-&q_n^T&- \end{bmatrix}=q_1q_1^T+...+q_nq_n^T=\textbf{I}\]These matrices are the best because they don't change the length of anything. You don't have to blow up, you don't have to going to zero. You can multiply together 1,000 matrices, and you still have another orthogonal matrix.
\[\Vert Qx\Vert^2=x^TQ^TQx=x^TQ^{-1}Qx=x^Tx=\Vert x\Vert^2\]Length is preserved! |
Eigenvalues of $Q$.
\[\begin{matrix} Qx=\lambda x,&\Vert Qx\Vert^2=\vert\lambda\vert^2\Vert x\Vert^2,&\vert\lambda\vert^2=1 \end{matrix}\]If we have a bunch of columns, not orthogonal, then, often, we would like to convert them like this:
\[A=QR\\ \begin{bmatrix} &&\\ a_1&...&a_n\\&& \end{bmatrix}=\begin{bmatrix} &&\\ q_1&...&q_n\\&& \end{bmatrix}\begin{bmatrix} r_{11}&r_{12}&.&r_{1n}\\ &r_{22}&.&r_{2n}\\ & &.&.\\&&&r_{nn} \end{bmatrix}\]- Columns $a_1$ to $a_n$ are independent;
- Columns $q_1$ to $q_n$ are orthonormal.
What's the difference between the word "orthogonal" and "orthonormal"? Two vectors are orthogonal if their inner product is zero. They are orthonormal if they are orthogonal, and additionally each vector has norm 1.
Least Squares: Major Applications of $A=QR$
$m>n$, $m$ equations $Ax=b$, $n$ unknowns, minimize $\Vert b-Ax\Vert^2=\Vert e\Vert^2$.
Normal equations for the best $\hat{x}$: $A^Te=0$ or $A^TA\hat{x}=A^Tb$.
we put a little hat on that x to show that it doesn't solve the original equation $Ax=b$, but it comes the closest. It's the closest solution I could find.
$A^TA$ (A transpose A) is a terrifically important matrix. It's a square matrix (A didn't have to be square) and also symmetric. A transpose A is a great matrix for theory. And the $QR$ makes it work in practice.
If $A=QR$ then $R^TQ^TQR\hat{x}=R^TQ^Tb$ leads to $R\hat{x}=Q^Tb$.
Part 4: Eigenvalues and Eigenvectors
$S=S^T$ has orthogonal eigenvectors $x^Ty=0$. Proof is important.
Start from these facts:
\[\begin{matrix} Sx=\lambda x,&Sy=\alpha y,&\lambda\ne\alpha,&S^T=S \end{matrix}\]- Transpose to $x^TS^T=\lambda x^T$ and use $S^T=S$. $\Rightarrow x^TSy=\lambda x^Ty$
- We can also multiply $Sy=\alpha y$ by $x^T$. $\Rightarrow x^TSy=\alpha x^Ty$
- Now $\lambda x^Ty=\alpha x^Ty$. Since $\lambda\ne\alpha$, $x^Ty$ must be zero.
Eigenvalues of $S$ go into Orthogonal Matrix $Q$
\[S\begin{bmatrix} &&\\ q_1&..&q_n\\&& \end{bmatrix}=\begin{bmatrix} &&\\ \lambda_1q_1&..&\lambda_nq_n\\&& \end{bmatrix}=\begin{bmatrix} &&\\ q_1&..&q_n\\&& \end{bmatrix}\begin{bmatrix} \lambda_1&&\\ &..&\\&&\lambda_n \end{bmatrix}\]That says
\[\begin{matrix} SQ=QA&\Rightarrow&S=Q\Lambda Q^{-1}=Q\Lambda Q^T \end{matrix}\]- $S=Q\Lambda Q^T$ is a sum $\sum_{i=1}^{n}\lambda_iq_iq_i^T$ of rank one matrices.
- With $S=A^TA$ this will lead to the sigular values of A.
- $A=U\Sigma V^T$ is a sum $\sum_{i=1}^n\sigma_iu_iv_i^T$ of rank one matrices.
- Singular values $\sigma_1$ to $\sigma_r$ in $\Sigma$. Singular vectors in $U$ and $V$.
Eigenvectors and Eigenvectors of $A$: Not symmetric
\[\begin{matrix} A\begin{bmatrix} x_1&..&x_n \end{bmatrix}=\begin{bmatrix} \lambda_1x_1&..&\lambda_nx_n \end{bmatrix},&AX=X\Lambda \end{matrix}\]With $n$ independent eigenvetors $A=X\Lambda X^{-1}$.
$A^2,A^3,..$ have the same eigenvectors as $A$.
\[\begin{matrix} A_n\rightarrow 0&when&\Lambda^n\rightarrow 0:&\textbf{All}~~\vert\lambda_i\vert<1 \end{matrix}\]$A^TA$ is square, symmetric, nonnegative definite
- Square. $A^TA=(n\times m)(m\times n)=n\times n$
- Symmetric. $(BA)^T=A^TB^T,(A^TA)=A^TA^{TT}=A^TA$
- $S=S^T$ is nonnegative definite IF
- EIGENVALUE TEST 1: All eigenvalues $\ge 0$, $Sx=\lambda x$;
- ENERGY TEXT 2: $x^TSx\ge 0$ for every vector $x$.
the matrices we see in data are rectangular, and eigenvalues don't make sense for them. And SINGULAR VALUE takes the place of the eigenvalues. |
Part 5: Singular Values and Singular Vectors
Singular value decomposition (SVD)
$A=U\Sigma V^T$ with $U^TU=I$ and $V^TV=1$.
$AV=U\Sigma$ means
\[\begin{matrix} A\begin{bmatrix} &&\\ v_1&..&v_n\\ && \end{bmatrix}=\begin{bmatrix} &&\\ u_1&..&u_n\\ && \end{bmatrix}\begin{bmatrix} \sigma_1&&\\ &..&\\ &&\sigma_n \end{bmatrix}&and&Av_i=\sigma_iu_i \end{matrix}\]SINGULAR VALUES $\sigma_1\ge\sigma_2\ge…\ge\sigma_r>0$, $r = \text{rank of } A$
How to choose orthonormal $v_i$ in the row space of $A$? The $v_i$ are eigenvectors of $A^TA$,
\[A^TAv_i=\lambda_iv_i=\sigma_i^2v_i\]The $v_i$ are orthonormal. ($V^TV=I$)
How to choose $u_i$ in the column space?
\[u_i=\frac{Av_i}{\sigma_i}\]the $u_i$ are orthonormal. (what shows below is an important step of $U^TU=I$)
\[\left(\frac{Av_j}{\sigma_j}\right)^T\left(\frac{Av_j}{\sigma_j}\right)=\frac{v_j^TA^TAv_j}{\sigma_j\sigma_i}=\frac{v_j^T\sigma_i^2v_j}{\sigma_j\sigma_i}=\begin{matrix} 1&i=j\\0&i\ne j \end{matrix}\]Full size SVD. $A=U\Sigma V^T=(m\times n)(m\times m)(n\times n)$
- $u_{r+1}$ to $u_m$: Nullspace of $A^T$
- $v_{r+1}$ to $v_n$: Nullspace of $A$
Low rank approximation to a big matrix
- Start from the SVD: $A=U\Sigma V^T=\sum_{i=1}^{r}\sigma_iu_iv_i^T$
- Keep the $k$ largest $\sigma_1$ to $\sigma_k$: $A_k=\sum_{i=1}^{k}\sigma_iu_iv_i^T$
- $A_k$ is the closest rank $k$ matrix to $A$: $\Vert A-A_k\Vert\le\Vert A-B_k\Vert$
Norms.
\[\begin{matrix} \Vert A\Vert=\sigma_{\max},&\Vert A\Vert_F=\sqrt{\sigma_1^2+...+\sigma_r^2},&\Vert A\Vert_N=\sigma_1+...+\sigma_r \end{matrix}\]Randomized Numerical Linear Algebra
For very large matrices, randomization has brought a revolution.
Example: Multiply $AB$ with Column-row sampling $(AS)(S^TB)$
\[\begin{matrix} AS=\begin{bmatrix} &&\\ a_1&a_2&a_3\\&& \end{bmatrix}\begin{bmatrix} s_{11}&0\\0&0\\0&s_{32} \end{bmatrix}=\begin{bmatrix} &\\ s_{11}a_1&s_{32}a_3\\& \end{bmatrix}&\text{and}&S^TB=\begin{bmatrix} s_{11}&b_1^T\\s_{32}&b_3^T \end{bmatrix} \end{matrix}\]NOTICE. $SS^T$ is not close to $I$. But we can have
\[\begin{matrix} E[SS^T]=I,&E[(AS)(S^TB)]=AB \end{matrix}\]Norm-squared sampling. Choose column-row with probabilities $\approx\Vert a_i\Vert\Vert b_i^T\Vert$
This choice minimizes the sampling variance.
If you wanna see the vedio, click MIT A 2020 Vision of Linear Algebra, Spring 2020.