StatQuest(ML) Linear Models

Published: June 20, 2021 by Authur Lee

Categories:
MachineLearning 9

Fitting a line to data, aka Least Squares, aka Linear Regression.

We can measure how well this line fits the data by seeing how close it is to the data points.

Back in the day, when they were first working this out, they probably tried taking the absolute value of everything and then discovered that it made the the math pretty tricky. So they ended up squaring each term. Squaring ensures that each term is positive.

And it is called the "sum of squared residuals", because the residuals are the differences between the real data and the line, and we are summing the square of these values.

The generic line equation is: y = a*x + b. a is the slope of the line, and b is the intercept.

All we want to do, is to find the optimal values for a and b so that we minimize the sum of squared residuals.

In more general math terms, $Sum\ of\ Squared\ Residuals=\sum_i ((ax_i+b-y_i))^2$.

We can use a 3-D graph to show how different values for the slope and intercept result in different sums of squares. But no one ever solved this by problem by hand. This is done on computer.

Linear Regression

aka General Linear Models

Three most important concepts behind Linear Regression:

use least-squares to fit a line to the data
calculate $R^2$
calculate a p-value for $R^2$

\[SS(mean)=(data-mean)^2, \\ Var(mean) = \frac{SS(mean)}{n},\ (n\ is\ the\ sample\ size)\\\]

SS(mean): Sum of squares around the around the mean;

Var(mean): Variance around the mean.

By using SS() and Var() function, we can use $SS(fit)$ and $Var(fit)$.

\[SS(fit) = (data - line)^2\\ Var(fit) = \frac{SS(fit)}{n}\]

$R^2$ tells us how much of the variation in A can be explained by taking B into account.

\[R^2 = \frac{Var(mean)-Var(fit)}{Var(mean)}\]

If $R^2=60\%$, that means there is 60% reduction in variance when we take the mouse weight into account. Alternatively, we can say that mouse weight "explains" 60% if the variation in mouse size.

We can also use the sums of squares to make the same calculation.

\[R^2 = \frac{SS(mean)-SS(fit)}{SS(mean)}\]

If $R^2=100\%$, that means mouse weight "explains" 100% of the variation in mouse size.

If we want to predict the mouse body length by using tail length and mouse weight, we need a 3-D space. In this 3-D space, if the z-axis is useless and doesn't make $SS(fit)$ smaller, then least-squares will ignore it by making that parameter = 0.

because in this case, plugging the tail length into the equation would have no effect on predicting the mouse size

This means equations with more parameters will never make $SS(fit)$ worse than equations with fewer parameters.

this is because least squares will cause any term that makes SS(fit) worse to be multiplied by 0, and, in a sense, no longer exist.

The more parameters we add to the equation, the more opportunities we have for random events to reduce $SS(fit)$ and result in a better $R^2$. Thus, people report an "adjusted $R^2$" value that, in essence, scales $R^2$ by the number of parameters.

If there are only two training data, there are always a line connecting them making the $SS(fit) = 0$. In this situation, $R^2$ cannot be used to evaluate the fit line or plane, then we need to use the $p-value$.

p-value

The p-value for $R^2$ comes from something called "F".

\[F = \frac{The\ variation\ in\ mouse\ size\ explained\ by\ weight}{The\ variation\ in\ mouse\ size\ not\ explained\ by\ weight} \\ R^2 = \frac{The\ variation\ in\ mouse\ size\ explained\ by\ weight}{Variation\ in\ mouse\ size\ without\ taking\ weight\ into\ account}\] \[R^2=\frac{SS(mean)-SS(fit)}{SS(mean)}\\ F = \frac{SS(mean)-\frac{SS(fit)}{p_{fit}-p_{mean}}}{\frac{SS(fit)}{n-p_{fit}}}\]

This $F$ equation will tell us if $R^2$ is significant. $p_{fit}-p_{mean}$ and $n-p_{fit}$ are the "degrees of freedom", they turn the sums of squares into variances.

$p_{fit}$ is the number of parameters in the fit line.

for example, if $y=y_intercept + slope\ x$, then $p_{fit}=2$.

$p_{mean}$ is the number of parameters in the mean line.

for example, if $y=y_intercept$, then $p_{fit}=1$.

Why divide SS(fit) by $n-p_{fit}$ instead of just n? Intuitively, the more parameters you have in your equation, the more data you need to estimate them. For example, you only need two points to estimate a line, but you need three points to estimate a plane.

\[\begin{aligned} F & = \frac{The\ variation\ in\ mouse\ size\ explained\ by\ weight}{The\ variation\ in\ mouse\ size\ not\ explained\ by\ weight} \\ & \rightarrow \frac{large\ number}{small\ number}\\ & = really\ large\ number \end{aligned}\]

How do we get p-value by using $F$?

The $p-value$ is number of more extreme values divided by all the values.

In practice, rather than generating tons of random datasets, people use the line to calculate the p-value.

The p-value will be smaller when there are more samples relative to the number of parameters in the fit equation.

Multiple Regression

Multiple Regression corresponds to high-dimensional space.

Calculating $R^2$, $F$, and the $p-value$ are the same for both simple and multiple regression.

Calculate the F-value is the exact same as before, only this time we replace the "mean" stuff…

\[F=\frac{SS(mean)-SS(fit)/(p_{fit}-p_{mean})}{SS(fit)/(n-p_{fit})}\\ \downarrow\\ F=\frac{SS(simmple)-SS(multiple)/(p_{multiple}-p_{simple})}{SS(multiple)/(n-p_{multiple})}\]

t-tests and ANOVA

The goal of a t-test is to compare means and see if they are significantly different from each other. If the same method can calculate p-values for a linear regression and a t-test, then we can easily calculate p-values for more complicated situations.

How to apply the t-test:

Ignore the x-axis and find the overall mean.
Calculate $SS(mean)$, the sum of squared residuals around the mean.
Fit a line for the data.
Calculate $SS(fit)$, the sum of squares of the residuals around the fitted line(s).

【Step 3】The design matrix can be combined with an abstract version of the equation to represent a "fit" to the data. In practice, the role of each column is assumed, and the equation is written out like this:

\[y = mean_{control} + mean_{mutant}\]

【Step 4】

\[F=\frac{SS(mean)-SS(fit)/(p_{fit}-p_{mean})}{SS(fit)/(n-p_{fit})}\]

Design Matrices are not always square matrices.

Design Matrices

Always, "1" turns A (like $mean_{control}$) "on" for this data point, and "0" turns B (like $difference_{(mutant-control)}$) "off" for this data point.

$p_{fit}$ means the number of parameters of the line fitted.

【$Formula 1$】

\[y\ =\ mean_{control}\ +\ mean_{mutant}\]

【$Formula2$】

\[y\ =\ mean_{control}\ +\ difference_{(mutant-control)}\]

Both $Formula 1$ and $Formula 2$ do the same thing and result in the same p-value, why is $Formula 2$ more common? I don't know the answer for sure, but I think it has something to do with regression.

If the matrix is

\[\begin{bmatrix} 1& 0\\ 1& 0\\ 1& 0\\ 1& 0\\ 1& 1\\ 1& 1\\ 1& 1\\ 1&1 \end{bmatrix}\]

For the first row, we can know that the corresponding formula is

\[y\ =\ 1\ \times mean_{control}\ +\ 0\ \times difference_{(mutant-control)}\]

Remember that the numbers in the first column are multiplied by $mean_{control}$, and the numbers in the second column are multiplied by $difference_{(mutant-control)}$.

Multiplying $mean_{control}$ by $1$ "turns it on" by just letting it be.

Multiplying $difference_{(mutant-control)}$ by $0$ makes it $0$ and that "turns it off".

A design matrix full of 1's and 0's is perfect for doing t-tests or ANOVAS - any time we have different categories of data - but we can use other numbers. For example, here's a design matrix for linear regression:

\[\begin{bmatrix} 1&0.9 \\ 1&1.6 \\ 1&2.3 \\ 1&3.5 \\ 1&4.2 \\ 1&5.1 \\ 1&5.5 \\ 1&5.6 \end{bmatrix}\]

However, the meaning of the design matrices are the same, no matter the elements in matrices are 1's and 0's or real numbers.

The style of design matrix (with 1's all down the first column) is more common.