Problem Description:
In this R homework, we are tasked with analyzing data on a specific species of wild mustard. The data includes several variables related to the plant, such as leaf glucosinolate levels, isothiocyanate levels, leaf thickness, the number of trichomes, and the average leaf area removed by herbivores (LAR).
The goal of the homework is to determine the best predictive model for LAR based on the available data. To do this, we will follow a step-by-step approach involving data analysis and regression modeling.
Solution
Data.
We are given the data on a species of wild mustard with the following variables;
- Leaf glucosinolate (ug/mg dry weight) are chemicals that have been implicated in herbivore resistance
- Isothiocyanates (ug/mg dry weight) are chemicals that have been implicated in herbivore resistance.
- Leaf thickness (mm) is the average thickness of leaves based on the average of 10 randomly selected leaves.
- Trichomes (average number per square cm) are small leaf hairs.
- LAR is a measure of leaf area removed by an herbivore.
First we read data into R workspace.
## combining data into one data.frame
mustard.df = data.frame(leaf.glucosinolate = leaf.glucosinolate,
leaf.thickness = leaf.thickness,
isothiocyanates = isothiocyanates,
trichomes = trichomes,LAR = LAR)
##structure of data
str(mustard.df)
## 'data.frame': 100 obs. of 5 variables:
## $ leaf.glucosinolate: num 129 156 110 105 117 ...
## $ leaf.thickness : num 1.1 0.4 0.8 0.5 0.6 1.2 0.9 0.5 1 0.7 ...
## $ isothiocyanates : num 214 169 158 136 150 ...
## $ trichomes : num 22.8 37.8 37.1 29.6 30.9 37.4 32.2 36.1 34.5 43.2 ...
## $ LAR : num 31.5 18.5 20.9 17.7 18.7 ...
A. Given the data above of average leaf area removed (mm^2) by caterpillars (LAR), what is the best predictive model of leaf area removed based on the available data?
First we fit the model with all predictors:
LAR_i=β_0+β_1 Leaf.glucosinolate_i+β_2 Leaf.thickness_i+β_3 Isothiocyanates_i+β_4 Trichomes_i+ε_i
# A. Given the data above of average leaf area removed ($mm^2$) by caterpillars (LAR),
# what is the best predictive model of leaf area removed based on the available data?
## fitting model with all variables
lm_all = lm(LAR~., data = mustard.df)
## summary from this model
summary(lm_all)
##
## Call:
## lm(formula = LAR ~ ., data = mustard.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2837 -2.7695 -0.3424 2.5524 15.1614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.88923 3.87901 -0.487 0.627
## leaf.glucosinolate 0.01985 0.02529 0.785 0.435
## leaf.thickness 3.56650 1.68620 2.115 0.037 *
## isothiocyanates 0.10810 0.01134 9.535 1.63e-15 ***
## trichomes 0.02494 0.05835 0.428 0.670
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.5 on 95 degrees of freedom
## Multiple R-squared: 0.6217, Adjusted R-squared: 0.6057
## F-statistic: 39.03 on 4 and 95 DF, p-value: < 2.2e-16
The summary shows that some variables are not significant. So the full model cannot be the best. We perform stepwise backward regression, where each step we remove one variable and fit the new model. We compare then two models using F test, the variable is excluded if the result of the F test confirms it. The procedure continues until no variable could be excluded using the Anova test.
We are call the built-in stepwise regression in R.
## stepwise regression
lm_best= step(lm_all, trace = -1)
## Start: AIC=305.71
## LAR ~ leaf.glucosinolate + leaf.thickness + isothiocyanates +
## trichomes
##
## Df Sum of Sq RSS AIC
## - trichomes 1 3.70 1927.8 303.90
## - leaf.glucosinolate 1 12.47 1936.6 304.35
##
## - leaf.thickness 1 90.61 2014.7 308.31
## - isothiocyanates 1 1841.41 3765.5 370.85
##
## Step: AIC=303.9
## LAR ~ leaf.glucosinolate + leaf.thickness + isothiocyanates
##
## Df Sum of Sq RSS AIC
## - leaf.glucosinolate 1 14.79 1942.6 302.66
##
## - leaf.thickness 1 91.93 2019.8 306.56
## - isothiocyanates 1 1845.92 3773.8 369.07
##
## Step: AIC=302.66
## LAR ~ leaf.thickness + isothiocyanates
##
## Df Sum of Sq RSS AIC
##
## - leaf.thickness 1 81.92 2024.5 304.79
## - isothiocyanates 1 1851.00 3793.6 367.59
## summary from this model
summary(lm_best)
##
## Call:
## lm(formula = LAR ~ leaf.thickness + isothiocyanates, data = mustard.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0300 -2.5728 -0.4106 2.4101 15.1797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.41168 1.74043 0.811 0.4193
## leaf.thickness 3.33593 1.64941 2.022 0.0459 *
## isothiocyanates 0.10834 0.01127 9.614 9.15e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.475 on 97 degrees of freedom
## Multiple R-squared: 0.618, Adjusted R-squared: 0.6102
## F-statistic: 78.47 on 2 and 97 DF, p-value: < 2.2e-16
So, the step wise regression chooses the model with the variables “Leaf.thickness” and “Isothiocyanates”, both are significant.
B. Analyze your model using analysis of variance.
# B. Analyze your model using analysis of variance.
## anova of the best model
anova(lm_best)
## Analysis of Variance Table
##
## Response: LAR
## Df Sum Sq Mean Sq F value Pr(>F)
## leaf.thickness 1 1292.2 1292.22 64.524 2.287e-12 ***
## isothiocyanates 1 1851.0 1851.00 92.425 9.152e-16 ***
## Residuals 97 1942.6 20.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F test based on ANOVA for the both selected predictors is highly significant.
We also obtain coefficient of determination, R^2, for the whole model:
## coefficient of determination:
summary(lm_best)$adj.r.squared
## [1] 0.610157
Which implies that over 61% of variation in LAR could be explained by the model, which is relatively good indicator.
C. Based on your model, what is the most important predictor for leaf area removed (and on what do you base this)?
We look at Pearson correlation coefficient for the two chosen predictors and the response variable “LAR”:
# C. Based on your model, what is the most important predictor
# for leaf area removed (and on what do you base this)?
# correlation
round(cor(mustard.df[,c("LAR","leaf.thickness","isothiocyanates")]),3)
## LAR leaf.thickness isothiocyanates
## LAR 1.000 0.504 0.776
## leaf.thickness 0.504 1.000 0.509
## isothiocyanates 0.776 0.509 1.000
We note that variable “isothiocyanates” has higher correlation with the response variable “LAR”. Secondly, looking at the summary of the model (p value and test statistic), here again the variable “isothiocyanates” is highly significant.
We conclude that the variable “isothiocyanates” is the most important predictor.
D. What would your model predict for leaf area removed (LAR) for a plant with 111.2 ug/mgdwglucosinolate, 149.2 ug/mgdw isothiocyanates, 0.6 mm leaf thickness, and 39.1 trichomes per sq mm?
Since the best model includes only 2 variables, we include use only these two variables for prediction.
# D. What would your model predict for leaf area removed (LAR)
# for a plant with 111.2 ug/mgdwglucosinolate, 149.2 ug/mgdw isothiocyanates,
# 0.6 mm leaf thickness, and 39.1 trichomes per sq mm?
newdata = data.frame(isothiocyanates = 149.2, leaf.thickness = 0.6)
pred.lm = predict(lm_best, newdata = newdata)
cat("\n Predicted LAR for the given plant is ", pred.lm)
##
## Predicted LAR for the given plant is 19.57753