R(25), Analysis

Published: by Creative Commons Licence


Covariance Analysis

We use regression analysis to create a model to describe the impact the impact of variables in the prediction variables on the response variables. Sometimes, if we have a category variable, such as yes/no or male/female, etc. Simple regression analysis provides multiple results for each value of a categorical variable. In this case, we can study the effect of categorical variables by using categorical variables together with predictive variables and comparing the regression lines of each level of categorical variables. Such an analysis is called covariance analysis, also known as ANCOVA.

get the dataset:

input <- mtcars[,c("am","mpg","hp")]

(I) There is an interaction among categorical variables and predictive variables in the model:

# create the regression model:
result <- aov(mpg~hp*am,data = input)

(II) There is no interaction between categorical variables and predictive variables.

result <- aov(mpg~hp+am,data = input)

We can use anova() to compare two models, to compare whether the interaction of two variables is really important.

# Get the dataset.
input <- mtcars

# Create the regression models.
result1 <- aov(mpg~hp*am,data = input)
result2 <- aov(mpg~hp+am,data = input)

# Compare the two models.
print(anova(result1,result2))

Time Series Analysis

The main purpose of time series analysis is to predict the future according to the existing data.

A stable time series often contains two parts: regular time series and noise.

Therefore, in the following methods, the main purpose is to filter the noise value, and make our time series more meaningful.

timeseries.object.name <-  ts(data, start, end, frequency)
  • data: a vector or matrix of the observed time-series values. A data frame will be coerced to a numeric matrix via data.matrix. (See also ‘Details’.)
  • start: the time of the first observation. Either a single number or a vector of two numbers (the second of which is an integer), which specify a natural time unit and a (1-based) number of samples into the time unit. See the examples for the use of the second form.
  • end: the time of the last observation, specified in the same way as start.
  • frequency: the number of observations per unit of time.

(except data, other parameters are optimal.)

Preprocessing of Time Series:

  1. stationarity test
  2. white noise test
  3. parameter characteristics of stationary time series

Steps of Model Building:

  1. Get the time series data set to be analyzed.
  2. Plot the data and observe its stationarity. If it is a non-stationary time series, it is necessary to perform d-order difference operation first and then transform it into a stationary time series, where D is d in ARIMA (P, D, q) model; If it is a stationary sequence, ARMA (P, q) model is used. So the difference between ARIMA (P, D, q) model and ARMA (P, q) model is that the characteristic polynomial of the autoregressive part of the former contains D unit roots.
  3. The autocorrelation coefficient ACF and partial autocorrelation coefficient PACF of the obtained stationary time series are obtained respectively. Through the analysis of autocorrelation graph and partial autocorrelation graph, the optimal level P and order q are obtained. From D, Q, P, ARIMA model is obtained.
  4. Model diagnosis. Diagnostic analysis was carried out to confirm that the obtained model is indeed consistent with the observed data characteristics. If not, go back to step (3).
# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
# Convert it to a time series object.
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)
# Print the timeseries data.
print(rainfall.timeseries)
# Give the chart file a name.
png(file = "rainfall.png")
# Plot a graph of the time series.
plot(rainfall.timeseries)
# Save the file.
dev.off()

the frequency in ts() function determine the time interval of measuring data points.

  • frequency = 12: specify the data points for each month of the year.
  • frequency = 4: specify the data points for each season of the year.
  • frequency = 6: specify the data points for each 10 minutes of an hour.
  • frequency = 24*6: specify the data points for each 10 minutes of a day.

Multiple Time Series: We can plot multiple time series in a chart by combining two series into a matrix.

# Get the data points in form of a R vector.
rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall2 <- c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)
# Convert them to a matrix.
combined.rainfall <-  matrix(c(rainfall1,rainfall2),nrow = 12)
# Convert it to a time series object.
rainfall.timeseries <- ts(combined.rainfall,start = c(2012,1),frequency = 12)
# Print the timeseries data.
print(rainfall.timeseries)
# Give the chart file a name.
png(file = "rainfall_combined.png")
# Plot a graph of the time series.
plot(rainfall.timeseries, main = "Multiple Time Series")
# Save the file.
dev.off()