R(23), Regression
Linear Regression
The general mathematical equation of linear regression is y=ax+b
.
y
: response variablex
: predictive variablea
&b
: coefficient constants
We use lm()
function to build the relation model between response variable and predictive variable.
lm(formula, data)
formula
: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.data
: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
# print the relation
print(relation)
result:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
we can also get relevant summaries.
print(summary(relation))
then we will get:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
predict()
is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.
predict(object, ...)
object
: a model object for which prediction is desired. (the formula we built usinglm()
before)...
: additional arguments affecting the predictions produced.
Multiple Regression
Multiple regression is the extension of linear regression to the relationship between two or more variables. In a simple linear relationship, we have a predictive variable and a response variable, but in multiple regression, we have multiple predictive variables and a response variable.
Multiple regression:
y = a + b1x1 + b2x2 + ... + bnxn
We still use lm()
function to build the relational model between prediction variables and response variable.
lm(y ~ x1+x2+x3...,data)
example:
get a subset from dataset named "mtcars".
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
result:
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
create the model and relationship model.
input <- mtcars[,c("mpg","disp","hp","wt")]
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","
")
a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
then we get the prediction model and then we can use is to predict the new x
.
From process before we can get the formula:
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
then we put the prediction variable then we can get the response variable.
Logistic Regression
Logistic regression is a regression model in which the response variable (dependent variable) has a categorical value such as true / false or 0 / 1. It actually measures the probability of binary response as the value of the response variable based on the mathematical equation related to the prediction variable.
y = 1/(1+e^(-(a+b1x1+b2x2+b3x3+...)))
For logistic regression, glm()
function is needed.
glm(formula, family, data)
glm()
is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
formula
: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.family
: a description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function, a family function or the result of a call to a family function. For glm.fit only the third option is supported. (See family for details of family functions.)data
: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from whichglm()
is called.
Example:
# select some columns from mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
# build the model
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
# show the summary to analysis
print(summary(am.data))
Poisson Regression
The response variable is in the form of count rather than fraction. Also, the response variable obey the Poisson distribution.
log(y) = a + b1x1 + b2x2 + bnxn...
Also, we use glm()
to implement the model building, like Logistic Regression.
input <- warpbreaks
output <-glm(formula = breaks ~ wool+tension, data = warpbreaks, family = poisson)
print(summary(output))