StatQuest(ML) Machine Learning Fundamentals

Published: May 14, 2021 by Authur Lee

Categories:
MachineLearning 9

A Gentle Introduction to Machine Learning

In general, Machine Learning is all about making predictions and classifications.

First of all, in Machine Learning Lingo, the original data is called Training Data.

line, squiggle

We use the testing data to evaluate Machine Learning methods.
Don't be fooled by how well a Machine Learning method fits the Training Data.

Fitting the Training Data well but making poor predictions is called the Bias-Variance Tradeoff.

There are tons of fancy sounding Machine Learning methods, and each year something new and exciting comes on the scene, but the regardless of what you use, the most important thing isn't how fancy it is, but how it performs with Testing Data.

Machine Learning Fundamentals: Cross Validation

Cross validation allows us to compare different machine learning methods and get a sense of how well they will work in practice.

Reusing the same data for both training and testing is a bad idea because we need to know how the method will work on data it wasn't trained on.

Rather than worry too much about which block would be best for testing, cross validation uses them all, one at a time, and summarizes the results at the end. In the end, every block of data is used for testing and we can compare methods by seeing how well they performed.

If we divided the data into 4 blocks, then it is called Four-Fold Cross Validation. However, the number of blocks is arbitrary.

In an extreme case, we could call each individual patient (or a sample) a block. This is called "Leave One out Cross Validation".

That said, in practice, it is very common to divide the data into 10 blocks. This is called "Ten-Fold Cross Validation".

Say like we wanted to use a method that involved a "tuning parameter" - a parameter that isn't estimated, but just sort of guessed. Then we can use 10-fold cross validation to help find the best value for that tuning parameter.

The Confusion Matrix

	Actual 1	Actual 0
Predicted 1	True Positives	False Positives
Predicted 0	False Negatives	True Negatives

More sophisticated metrics: Sensitivity, Specificity, ROC and AUC, that can help us make a decision.

More complex Confusion Matrix:

	value A	value B	value C
value A	True	false	false
value B	false	True	false
value C	false	false	True

Ultimately, the size of the confusion matrix is determined by the number of things we want to predict.

In summary, a Confusion Matrix tells you what your machine learning algorithm did right and what it did wrong.

Sensitivity and Specificity

2*2 matrix

	Has Heart Disease	Does Not Have Heart Disease
Has Heart Disease	True Positives	False Positives
Does Not Have Heart Disease	False Negatives	True Negativess

In this case, Sensitivity tells us what percentage of patients with heart disease were correctly identified.

\[Sensitivity = \frac{True\ Positives}{True\ Positives\ +\ False\ Negatives}\]

Specificity tells us what percentage of patients without heart disease were correctly identified identified.

\[Specificity = \frac{True\ Negatives}{True\ Negatives\ +\ False\ Positives}\]

For example, we calculate the Sensitivity and Specificity of a real problems and get the results below, and then we analysis it.

	Sensitivity	Specificity
Logistic Regression	0.81	0.85
Random Forest	0.83	0.83

Sensitivity tells us that the Random Forest is slightly better at correctly identifying positives, which, in this case, are patients with better heart disease.
Specificity tells us that Logistic Regression is slightly better at correctly identifying negatives, which, in this case, are patients without heart disease.
We would choose the Logistic Regression model if correctly identifying patients without heart disease was more important than correctly identifying patients with heart disease.
Alternatively, we would choose the Random Forest model if …

3*3 Matrix

The big difference when calculating Sensitivity and Specificity for larger confusion matrices is that there are no single values that work for the entire matrix, instead we calculate a different Sensitivity and Specificity for each category, like:

Sensitivity	Specificity
$Sensitivity_{Troll\ 2}$	$Specificity_{Troll\ 2}$
$Sensitivity_{Gore\ Police}$	$Specificity_{Gore\ Police}$
$Sensitivity_{Cool\ as\ Ice}$	$Specificity_{Cool\ as\ Ice}$

Observe the Confusion Matrix by column.

Example:

	Troll 2	Gore Police	Cool as Ice
Troll 2	12	102	93
Gore Police	112	23	77
Cool as Ice	83	92	17

\[Sensitivity_{Troll\ 2} = \frac{True\ Positives_{Troll\ 2}}{True\ Positives_{Troll\ 2}\ +\ False\ Negatives_{Troll\ 2}} \\ then,\ Sensitivity_{Troll\ 2} = \frac{12}{12+(112+83)} = 0.06 \\\]

$Sensitivity_{Troll\ 2}$ tells us that only 6% of the people that loved the movie Troll 2 more than Gore Police or Cool as Ice were correctly identified.

\[Specificity_{Troll\ 2} = \frac{True\ Negatives_{Troll\ 2}}{True\ Negatives_{Troll\ 2}\ +\ False\ Positives_{Troll\ 2}} \\ then,\ Specificity_{Troll\ 2} = \frac{(23+77+92+17)}{(23+77+92+17)+(102+93)} = 0.52\]

$Specificity_{Troll\ 2}$ tells us that 52% of the people who loved Gore Police or Cool as Ice more than Troll 2 were correctly identified.

love sth. more

Bias and Variance

Linear Regression(aka "Least Square") fits a Straight Line to the training set.

The inability for a machine learning method (like linear regression) to capture the true relationship is called bias. (because the straight line can't be curved like the "true" relationship, it has a relatively large amount of bias.)

Sums of Square: we measure the distances from the fit lines to the data, square them and add them up.

psst. they are squared so that negative distances do not cancel out the positive distances.

In Machine Learning lingo, the difference in fits between data sets is called Variance.

Overfit: if the Squiggly Line fits the training set really well, but not the testing set, we say that the Squiggly Line is overfit.

Three commonly used methods fot finding the sweet spot between simple and complicated models are: regularization, boosting and bagging. (the Random Forest show an example of bagging in action)