R(23), Regression

Published: by Creative Commons Licence

Linear Regression

The general mathematical equation of linear regression is y=ax+b.

  • y: response variable
  • x: predictive variable
  • a & b: coefficient constants

We use lm() function to build the relation model between response variable and predictive variable.

lm(formula, data)
  • formula: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
  • data: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.
relation <- lm(y~x)

# print the relation
print(relation)

result:

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
   -38.4551          0.6746 

we can also get relevant summaries.

print(summary(relation))

then we will get:

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q     Median      3Q     Max 
-6.3002    -1.6629  0.0412    1.8944  3.9775 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -38.45509    8.04901  -4.778  0.00139 ** 
x             0.67461    0.05191  12.997 1.16e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared:  0.9548,    Adjusted R-squared:  0.9491 
F-statistic: 168.9 on 1 and 8 DF,  p-value: 1.164e-06

predict() is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

predict(object, ...)
  • object: a model object for which prediction is desired. (the formula we built using lm() before)
  • ...: additional arguments affecting the predictions produced.

Multiple Regression

Multiple regression is the extension of linear regression to the relationship between two or more variables. In a simple linear relationship, we have a predictive variable and a response variable, but in multiple regression, we have multiple predictive variables and a response variable.

Multiple regression:

y = a + b1x1 + b2x2 + ... + bnxn

We still use lm() function to build the relational model between prediction variables and response variable.

lm(y ~ x1+x2+x3...,data)

example:

get a subset from dataset named "mtcars".

input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))

result:

                   mpg   disp   hp    wt
Mazda RX4          21.0  160    110   2.620
Mazda RX4 Wag      21.0  160    110   2.875
Datsun 710         22.8  108     93   2.320
Hornet 4 Drive     21.4  258    110   3.215
Hornet Sportabout  18.7  360    175   3.440
Valiant            18.1  225    105   3.460

create the model and relationship model.

input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.
print(model)

# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","
")

a <- coef(model)[1]
print(a)

Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]

print(Xdisp)
print(Xhp)
print(Xwt)

then we get the prediction model and then we can use is to predict the new x.

From process before we can get the formula:

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3

then we put the prediction variable then we can get the response variable.

Logistic Regression

Logistic regression is a regression model in which the response variable (dependent variable) has a categorical value such as true / false or 0 / 1. It actually measures the probability of binary response as the value of the response variable based on the mathematical equation related to the prediction variable.

y = 1/(1+e^(-(a+b1x1+b2x2+b3x3+...)))

For logistic regression, glm() function is needed.

glm(formula, family, data)

glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

  • formula: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
  • family: a description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function, a family function or the result of a call to a family function. For glm.fit only the third option is supported. (See family for details of family functions.)
  • data: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which glm() is called.

Example:

# select some columns from mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
# build the model
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
# show the summary to analysis
print(summary(am.data))

Poisson Regression

The response variable is in the form of count rather than fraction. Also, the response variable obey the Poisson distribution.

log(y) = a + b1x1 + b2x2 + bnxn...

Also, we use glm() to implement the model building, like Logistic Regression.

input <- warpbreaks
output <-glm(formula = breaks ~ wool+tension, data = warpbreaks, family = poisson)
print(summary(output))