Time Series Analysis for Wine Quality

Problem Description:

The Time Series Analysis using R homework requires the analysis and cleaning of a time series dataset containing at least two predictor variables and one response variable (yt). The goal is to determine a model that best fits the data. The provided dataset can be downloaded from Kaggle:

Solution

Select a time series dataset which contains at least two predictor variables and one response variable (yt). We need to analyze and clean the data to ascertain a model which best fits the data.

##https://www.kaggle.com/datasets/yasserh/wine-quality-dataset/download?datasetVersionNumber=1

##the dataset can be downloaded and saved.

1. Exploratory Data Analysis

First check for completeness and consistency of the data (if there are NAs or missing observations, replace with the value of the previous observation; make a note of this)
Provide descriptive analyses of your variables. This should include the five-number summary, histograms with fitted distribution, correlation plots (or heatmap). All figures must include comments.

mydata<-read.csv("C:UsersUserOneDriveDesktop>winequality-red.csv",header =T,sep = ",")

mydata1<-(mydata[,1:3])

summary(mydata1)

fixed.acidity volatile.acidity citric.acid

Min. : 4.60 Min. :0.1200 Min. :0.000

1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090

Median : 7.90 Median :0.5200 Median :0.260

Mean : 8.32 Mean :0.5278 Mean :0.271

3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420

Max. :15.90 Max. :1.5800 Max. :1.000

## Plotting a histogram

fixed.acidity <- mydata$fixed.acidity

hist(fixed.acidity)

## the data is normally distributed as show above

# plotting corr heatmap

# Load and install heatmaply package

install.packages("heatmaply")

library(heatmaply)

heatmaply_cor(x = cor(mydata1), xlab = "Features",

ylab = "Features", k_col = 2, k_row = 2)

## Colinearity between different features can be observed

2. Data pre-processing

(a) With tsdisplay or ggtsdisplay, for each variable, use its time series plot, ACF and PACF to comment on its stationarity (you can also decompose the time series). To supplement this, use the appropriate Dickey-Fuller (unit root) test, to determine whether or not it is stationary.

# Importing library

install.packages("tseries")

library(tseries)

## for first variable (volatile.acidity)

mydata2<-(mydata[,1:2])

mydata2

mydata2<-as.data.frame(mydata1)

wine.ts1<-ts(mydata2[,2], frequency=4)

#par(mfrow=c(3,1), mar=c(4,4,4,4)) ### Figure 4.5

plot(wine.ts1, type="l", xlab="volatile.acidity",ylab="fixed.acidity")

acf(mydata2[,2], 25, xlab="Lag", ylab="ACF", main="")

acf(mydata2[,2], 25, type="partial", xlab="Lag",ylab="Partial ACF", main="")

## for second variable (citric.acid)

mydata3<-(mydata[,3])

mydata3

mydata3<-as.data.frame(mydata1)

wine.ts2<-ts(mydata3[,2], frequency=4)

#par(mfrow=c(3,1), mar=c(4,4,4,4)) ### Figure 4.5

plot(wine.ts2, type="l", xlab="citric.acid",ylab="fixed.acidity")

acf(mydata3[,2], 25, xlab="Lag", ylab="ACF", main="")

# Conduct the augmented Dickey-Fuller test

data1=mydata$volatile.acidity

data2=mydata$citric.acid

adf.test(data1)

Augmented Dickey-Fuller Test

data: data2

Dickey-Fuller = -8.62, Lag order = 11, p-value = 0.01

alternative hypothesis: stationary

The test statistic and p-value come out to be equal to -8.62 and 0.01 respectively.

Since the p-value is less than 0.05 we would reject the null hypothesis. It implies that the time series is stationary.

adf.test(data2)

Augmented Dickey-Fuller Test

data: data2

Dickey-Fuller = -7.2507, Lag order = 11, p-value = 0.01

alternative hypothesis: stationary

## The test statistic and p-value come out to be equal to -7.2507and 0.01 respectively.

Since the p-value is less than than 0.05, we would reject the null hypothesis. It implies that the time series is stationary.

(b) If it is not stationary, determine the level of differencing to make our series stationary. We can use the ndiffs function which performs a unit-root test to determine this. After this, difference your data to ascertain a stationary time series. Re-do part c) for your differenced time series and comment on the time series plot, ACF and PACF. Recall that the time series models we’ve observed rely on stationarity.

#Difference the data

dwine1<-diff(mydata2[,2], lag=4)

#par(mfrow=c(3,1), mar=c(4,4,4,4)) ### Figure 4.6

plot(diff(wine.ts1, lag=4), type="l", xlab="volatile.acidity",ylab="Differenced Series")

acf(dwine1, 25, xlab="Lag", ylab="ACF", main="")

acf(dwine1, 25, type="partial", xlab="Lag",ylab="Partial ACF", main="")

## The series is stationary since the ACF cuts off at lag one and the PACF attenuates.

3. Feature Generation, Model Testing and Forecasting.

(a) Fit an AR(p) model to the data (using part (a), AIC or some built in R function)

#Fitting the model

dwine1.aic<-matrix(0,5,5)

for (i in 0:4) for (j in 0:4) {

dwine1.fit<-arima(dwine1, order=c(i,0,j))

dwine1.aic[i+1,j+1]<-dwine1.fit$aic

}

dwine1.fit<-arima(dwine1, order=c(3,0,4))

dwine1.fit

Call:

arima(x = dwine1, order = c(3, 0, 4))

Coefficients:

ar1 ar2 ar3 ma1 ma2 ma3 ma4

0.1165 0.0627 0.0377 0.0202 0.0216 0.0253 -0.9679

s.e. 0.0264 0.0262 0.0264 0.0085 0.0082 0.0085 0.0081

intercept

0e+00

s.e. 6e-04

install.packages("dLagM")

library(dLagM)

data(mydata)

model.ardlDlm1 = ardlDlm(formula =fixed.acidity ~ volatile.acidity + citric.acid,

data = data.frame(mydata) , p = 2 , q = 1 )

summary(model.ardlDlm1)

Time series regression with "ts" data:

Start = 3, End = 1599

Call:

dynlm(formula = as.formula(model.text), data = data)

Residuals:

Min 1Q Median 3Q Max

-5.6897 -0.7233 -0.0334 0.7063 6.6500

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.72772 0.24723 15.078 < 2e-16 ***

volatile.acidity.t 1.44002 0.20717 6.951 5.28e-12 ***

volatile.acidity.1 -0.19413 0.21183 -0.916 0.3596

volatile.acidity.2 0.11671 0.20695 0.564 0.5729

citric.acid.t 6.43283 0.19860 32.392 < 2e-16 ***

citric.acid.1 -1.61361 0.25607 -6.301 3.81e-10 ***

citric.acid.2 0.45340 0.19801 2.290 0.0222 *

fixed.acidity.1 0.29360 0.02403 12.216 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.21 on 1589 degrees of freedom

Multiple R-squared: 0.5196, Adjusted R-squared: 0.5175

F-statistic: 245.5 on 7 and 1589 DF, p-value: < 2.2e-16

4. Provide a brief summary of your findings and state which model performs better.

## adjusted R2. is 0.5175 which means that at least half of the response variable is explained by the predictor variables.

This model has equally distributed residuals around zero. The errors of this model are within a small bandwidth.

This fitted model therefore performs better than the first two models.

5. Suggest any improvements which could’ve been made to the model based on your findings, and what was learnt in class.

From the finding we can see that an increase in volatile. acidity increase the fixed. acidity positively also Increase in citric.acid increase the fixed.acidity

Thus, both should be controlled at an average level.

Analyzing Time Series Data for Wine Quality Using the R Environment

Problem Description:

Solution