Quality Control
As the final step, we would like to investigate the impact of temperature, density, am and rate on the number of defects, using multiple linear regression models. The objective is to see if you can recommend your manager Ole a way to decrease the number of defects in production.
Use quality data to answer questions in the QualityControl_Part2_RScript.
What to submit: Final submission must include (1) rmd file and (2) knitted HTML of your script.
Quality Control Solution
---
title: "Quality_RScript_Answers"
author: "YOUR NAME HERE"
date: "ADD THE DATE"
output:
html_document: default
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see .
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Use control+Enter to run the code chunks on PC.
Use command+Enter to run the code chunks on MAC.
Load Packages
In this section, we install and load the necessary packages.
```{r libraries, message=FALSE, include = FALSE}
### Install packages. If you haven't install the following package, please uncomment the line below to install it. Then, comment it back before knitting the document.
#install.packages("ggplot2")
### load libraries for use in current working session
library('dplyr')
library('ggplot2')
```
## Import Data
In this section, we import the necessary data for this lab.
```{r import, include=FALSE}
### set your working directory
# use setwd to set your working directory
# you can also go to session-> set working directory -> choose directory
# working directory is the path to the folder and not file
# make sure the path of the directory is correct, i.e., where you have stored your data
setwd("C:/Users/sansar13/Documents/mgt585/data")
### import data file
# read the files using read.csv
quality<- read.csv(file = "quality.csv")
```
Quality Control Case
Everybody seems to disagree about just why so many parts have to be fixed or thrown away after they are produced. Some say that it’s the temperature of the production process, which needs to be held constant (within a reasonable range). Others claim that it’s clearly the density of the product, and that if we could only produce a heavier material, the problems would disappear. Then there is Ole the site manager, who has been warning everyone forever to take care not to push the equipment beyond its limits. This problem would be the easiest to fix, simply by slowing down the production rate; however, this would increase costs. Unfortunately, rate is the only variable that the manager can control. Interestingly, many of the workers on the morning shift think that the problem is “those inexperienced workers in the afternoon,†who, curiously, feel the same way about the morning workers.
Ever since the factory was automated, with computer network communication and bar code readers at each station, data have been piling up. After taking MGT585 class, you’ve finally decided to have a look. Your assistant aggregated the data by 4-hour blocks and then typed in the AM/PM variable, you found the following description of the variables:
*temp*: measures the temperature variability as a standard deviation during the time of measurement
*density*: indicates the density of the final product
*rate*: rate of production
*am*: 1 indicates morning and 0 afternoon
*defect*: average number of defects per 1000 produced
Do the following tasks and answer the questions below.
Data Transformation, Descriptive Stats and Visualization of 'am' Variable
*am* is categorical variable: 1 = morning and 0 = afternoon.
1. Convert *am* to factor
2. Calculate the frequency distribution for *am*
3. Plot the relationship between *am* and *defect* using a bar chart
```{r amTask}
# Correct the type of am
# descriptive stats for am: frequency distribution
# use bar chart to plot relationship between am and defect
```
**Question 1**: What does frequency distribution and the bar chart show?
## Task 2: Simple Regression Model with a Qualitative Preditor
Use lm() to run a regression analysis on am as X and defect as Y.
```{r qualityam}
# Use lm() to run a regression analysis on am as X and defect as Y
```
**Question 2**: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.
**Question 3**: Interestingly, many of the workers on the morning shift think that the problem is those inexperienced workers in the afternoon, who, curiously, feel the same way about the morning workers. Based on your regression analysis, do you think the claim by morning shift workers is true?
## Task 3: Multiple Regression Model
Run a full model of multiple linear regression on temp, density, rate, am and interaction between rate and am.
```{r qualityFull}
# Use lm() to run a multiple regression analysis
```
**Question 4**: How do you interpret the results?
**Question 5**: What is your final recommendation to your manager Ole?
Stock Market
This case is similar in nature to the Smarket data from the R lab.
We analyze the percentage returns for the S&P 500 stock index over 1,089 weekly returns for 21 years, from the beginning of 1990 until the end of 2010. The objective is to predict market movement (Up or Down) using historical data.
Use weekly data to answer questions in the Stock Market_RScript.
Stock Market Solution
title: "Stock Market_RScript_Answers"
author: "YOUR NAME HERE"
date: "ADD THE DATE"
output:
html_document: default
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see .
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Use control+Enter to run the code chunks on PC.
Use command+Enter to run the code chunks on MAC.
## Load Packages
In this section, we install and load the necessary packages.
```{r libraries, message=FALSE, include = FALSE}
### Install packages. If you haven't install the following package, please uncomment the line below to install it. Then, comment it back before knitting the document.
#install.packages("ggplot2")
### load libraries for use in current working session
library('ggplot2')
library('class') # to run KNN
library('ROSE') # to generate ROC
```
## Import Data
In this section, we import the necessary data for this lab.
```{r import, include=FALSE}
### set your working directory
# use setwd to set your working directory
# you can also go to session-> set working directory -> choose directory
# working directory is the path to the folder and not file
# make sure the path of the directory is correct, i.e., where you have stored your data
setwd("C:/Users/sansar13/Documents/mgt585/data")
### import data file
# read the files using read.csv
Weekly <- read.csv(file = "weekly.csv")
```
# Stock Market Case
We use the *Weekly.csv* data set, which is similar in nature to the Smarket data from the R lab.
This data set consists of percentage returns for the S&P 500 stock index over 1,089 weekly returns for 21 years, from the beginning of 1990 until the end of 2010. For each week, we have recorded the percentage returns for each of the five previous trading weeks, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous week, in billions), Today (the percentage return for this week) and Direction (whether the market was Up or Down on this week).
Do the following tasks and answer the questions below.
## Task 1: Data exploration
Produce some numerical and graphical summaries of the Weekly data.
```{r Weeklyexplore}
# Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail
# use summary() to print the descriptive statistics
# Correct the type of 'Direction' which has to be factor
# use pairs() to produce a matrix that contains all of the pairwise correlations among the predictors in a data set.
# use cor to create the correlation matrix of all numerical variables.
```
**Question 1** : Do there appear to be any patterns?
## Task 2: Logistic Regression
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results.
```{r WeeklyLogistic}
# Use glm() to run a logistic analysis on Lag1 through Lag5 and Volume as predictors and Direction as the response
```
**Question 2**: Do any of the predictors appear to be statistically significant? If so, which ones?
## Task 3: Confusion Matrix
Compute the confusion matrix and overall fraction of correct predictions.
```{r WeeklyConfusion}
# predict the Direction probability of the whole dataset using the fitted logistic regression
# create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5
# Use table() function to produce a confusion matrix
```
Use the confusion matrix to compute Accuracy, Sensitivity and Specificity.
```{r}
# Accuracy
# Sensitivity
# Specificity
```
**Question 3**: Explain what the confusion matrix is telling you about the types of errors made by logistic regression. In other words, interpret the Accuracy, Sensitivity and Specificity.
## Task 4: Training and Testing Sets
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions (accuracy) for the held out data (that is, the data from 2009 and 2010).
```{r traintestWeekly}
# set seed to 1
set.seed(1)
## split the data into training and testing sets based on the year.
# Use the data before 2009 as the training set and use the data of years 2009 and 2010 as the testing test
# Use glm() to run a logistic analysis on Lag2 as predictor and Direction as the response
# predict the Direction probability of the whole dataset using the fitted logistic regression
# create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5
# Use table() function to produce a confusion matrix
# Calculate accuracy
```
**Question 4**: Is this classifier better than the logistic model fitted in Task 2? Explain.
## Task 5
Repeat Task 4 using KNN with K = 1 and K = 10. Note that you should only use Lag2 as the predictor
```{r Weeklyk1}
### KNN for k=1
## IMPORTANT: you must use as.matrix() function to covert to matrix
# This is a requirement imposed by knn() function
# So, you should write knn(as.matrix(trainWeekly[,'Lag2']), as.matrix(testWeekly[,'Lag2']), trainWeekly$Direction, k = 1)
# Use table() function to produce a confusion matrix
# Calculate accuracy
### KNN k = 10
# Use table() function to produce a confusion matrix
# Calculate accuracy
```
## Task 6
Plot ROC curve and compute AUC for the latest logistic regression, KNN (k = 1) and KNN (k = 10).
```{r roc}
# ROC curve for logistic regression
# ROC curve for KNN k = 1
# ROC curve for KNN k = 10
```
**Question 5 **: Which of these methods appears to provide the best results on this data? Use accuracy, AUC and ROC Curve results.