This comprehensive homework with solution uses R functionalities to predict the leaf area removed (LAR) by caterpillars in a species of wild mustard. We explore a range of variables, from leaf glucosinolate and isothiocyanate levels to leaf thickness and trichome count, seeking to build the most accurate predictive model. Follow our journey as we step through data preparation, model refinement, and statistical analysis to identify the key predictors of LAR in this fascinating plant species.
Problem Description:
In this R homework, we are tasked with analyzing data on a specific species of wild mustard. The data includes several variables related to the plant, such as leaf glucosinolate levels, isothiocyanate levels, leaf thickness, the number of trichomes, and the average leaf area removed by herbivores (LAR).
The goal of the homework is to determine the best predictive model for LAR based on the available data. To do this, we will follow a step-by-step approach involving data analysis and regression modeling.
Solution
Data.
We are given the data on a species of wild mustard with the following variables;
- Leaf glucosinolate (ug/mg dry weight) are chemicals that have been implicated in herbivore resistance
- Isothiocyanates (ug/mg dry weight) are chemicals that have been implicated in herbivore resistance.
- Leaf thickness (mm) is the average thickness of leaves based on the average of 10 randomly selected leaves.
- Trichomes (average number per square cm) are small leaf hairs.
- LAR is a measure of leaf area removed by an herbivore.
First we read data into R workspace.
## combining data into one data.frame
mustard.df = data.frame(leaf.glucosinolate = leaf.glucosinolate,
leaf.thickness = leaf.thickness,
isothiocyanates = isothiocyanates,
trichomes = trichomes,LAR = LAR)
##structure of data
str(mustard.df)
## 'data.frame': 100 obs. of 5 variables:
## $ leaf.glucosinolate: num 129 156 110 105 117 ...
## $ leaf.thickness : num 1.1 0.4 0.8 0.5 0.6 1.2 0.9 0.5 1 0.7 ...
## $ isothiocyanates : num 214 169 158 136 150 ...
## $ trichomes : num 22.8 37.8 37.1 29.6 30.9 37.4 32.2 36.1 34.5 43.2 ...
## $ LAR : num 31.5 18.5 20.9 17.7 18.7 ...
A. Given the data above of average leaf area removed (mm^2) by caterpillars (LAR), what is the best predictive model of leaf area removed based on the available data?
First we fit the model with all predictors:
LAR_i=β_0+β_1 Leaf.glucosinolate_i+β_2 Leaf.thickness_i+β_3 Isothiocyanates_i+β_4 Trichomes_i+ε_i
# A. Given the data above of average leaf area removed ($mm^2$) by caterpillars (LAR),
# what is the best predictive model of leaf area removed based on the available data?
## fitting model with all variables
lm_all = lm(LAR~., data = mustard.df)
## summary from this model
summary(lm_all)
##
## Call:
## lm(formula = LAR ~ ., data = mustard.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2837 -2.7695 -0.3424 2.5524 15.1614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.88923 3.87901 -0.487 0.627
## leaf.glucosinolate 0.01985 0.02529 0.785 0.435
## leaf.thickness 3.56650 1.68620 2.115 0.037 *
## isothiocyanates 0.10810 0.01134 9.535 1.63e-15 ***
## trichomes 0.02494 0.05835 0.428 0.670
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.5 on 95 degrees of freedom
## Multiple R-squared: 0.6217, Adjusted R-squared: 0.6057
## F-statistic: 39.03 on 4 and 95 DF, p-value: < 2.2e-16
The summary shows that some variables are not significant. So the full model cannot be the best. We perform stepwise backward regression, where each step we remove one variable and fit the new model. We compare then two models using F test, the variable is excluded if the result of the F test confirms it. The procedure continues until no variable could be excluded using the Anova test.
We are call the built-in stepwise regression in R.
## stepwise regression
lm_best= step(lm_all, trace = -1)
## Start: AIC=305.71
## LAR ~ leaf.glucosinolate + leaf.thickness + isothiocyanates +
## trichomes
##
## Df Sum of Sq RSS AIC
## - trichomes 1 3.70 1927.8 303.90
## - leaf.glucosinolate 1 12.47 1936.6 304.35
##
## - leaf.thickness 1 90.61 2014.7 308.31
## - isothiocyanates 1 1841.41 3765.5 370.85
##
## Step: AIC=303.9
## LAR ~ leaf.glucosinolate + leaf.thickness + isothiocyanates
##
## Df Sum of Sq RSS AIC
## - leaf.glucosinolate 1 14.79 1942.6 302.66
##
## - leaf.thickness 1 91.93 2019.8 306.56
## - isothiocyanates 1 1845.92 3773.8 369.07
##
## Step: AIC=302.66
## LAR ~ leaf.thickness + isothiocyanates
##
## Df Sum of Sq RSS AIC
##
## - leaf.thickness 1 81.92 2024.5 304.79
## - isothiocyanates 1 1851.00 3793.6 367.59
## summary from this model
summary(lm_best)
##
## Call:
## lm(formula = LAR ~ leaf.thickness + isothiocyanates, data = mustard.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0300 -2.5728 -0.4106 2.4101 15.1797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.41168 1.74043 0.811 0.4193
## leaf.thickness 3.33593 1.64941 2.022 0.0459 *
## isothiocyanates 0.10834 0.01127 9.614 9.15e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.475 on 97 degrees of freedom
## Multiple R-squared: 0.618, Adjusted R-squared: 0.6102
## F-statistic: 78.47 on 2 and 97 DF, p-value: < 2.2e-16
So, the step wise regression chooses the model with the variables “Leaf.thickness” and “Isothiocyanates”, both are significant.
B. Analyze your model using analysis of variance.
# B. Analyze your model using analysis of variance.
## anova of the best model
anova(lm_best)
## Analysis of Variance Table
##
## Response: LAR
## Df Sum Sq Mean Sq F value Pr(>F)
## leaf.thickness 1 1292.2 1292.22 64.524 2.287e-12 ***
## isothiocyanates 1 1851.0 1851.00 92.425 9.152e-16 ***
## Residuals 97 1942.6 20.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F test based on ANOVA for the both selected predictors is highly significant.
We also obtain coefficient of determination, R^2, for the whole model:
## coefficient of determination:
summary(lm_best)$adj.r.squared
## [1] 0.610157
Which implies that over 61% of variation in LAR could be explained by the model, which is relatively good indicator.
C. Based on your model, what is the most important predictor for leaf area removed (and on what do you base this)?
We look at Pearson correlation coefficient for the two chosen predictors and the response variable “LAR”:
# C. Based on your model, what is the most important predictor
# for leaf area removed (and on what do you base this)?
# correlation
round(cor(mustard.df[,c("LAR","leaf.thickness","isothiocyanates")]),3)
## LAR leaf.thickness isothiocyanates
## LAR 1.000 0.504 0.776
## leaf.thickness 0.504 1.000 0.509
## isothiocyanates 0.776 0.509 1.000
We note that variable “isothiocyanates” has higher correlation with the response variable “LAR”. Secondly, looking at the summary of the model (p value and test statistic), here again the variable “isothiocyanates” is highly significant.
We conclude that the variable “isothiocyanates” is the most important predictor.
D. What would your model predict for leaf area removed (LAR) for a plant with 111.2 ug/mgdwglucosinolate, 149.2 ug/mgdw isothiocyanates, 0.6 mm leaf thickness, and 39.1 trichomes per sq mm?
Since the best model includes only 2 variables, we include use only these two variables for prediction.
# D. What would your model predict for leaf area removed (LAR)
# for a plant with 111.2 ug/mgdwglucosinolate, 149.2 ug/mgdw isothiocyanates,
# 0.6 mm leaf thickness, and 39.1 trichomes per sq mm?
newdata = data.frame(isothiocyanates = 149.2, leaf.thickness = 0.6)
pred.lm = predict(lm_best, newdata = newdata)
cat("\n Predicted LAR for the given plant is ", pred.lm)
##
## Predicted LAR for the given plant is 19.57753
Similar Samples
Explore our sample section to see how we solve various statistics problems using R programming. These examples demonstrate the approach we take to deliver accurate and efficient solutions, ensuring clarity and precision in every task. Check them out to understand our methods and work quality.
Statistical Analysis
R Programming
Time Series Analysis
Statistical Analysis
STATA
Econometrics
Statistical Tests
Data Analysis
Biostatistics
SPSS
Econometrics
Statistical Analysis
Statistics
Statistics
STATA
Statistics
Statistical Analysis
R Programming
Business Intelligence
Statistics