×
Services Samples Blogs Make Payment About us Reviews 4.9/5 Order Now

Computation of Simple Linear Regression and Drawing Meaningful Conclusions

November 19, 2022
Dr. Eamon Hale
Dr. Eamon
🇺🇸 United States
Statistics
Dr. Eamon Hale, a Statistics Homework Expert, earned his Ph.D. from Johns Hopkins University, one of the top universities in the USA. With over 12 years of experience, he excels in providing insightful statistical analysis and data-driven solutions for students.
Key Topics
  • Assignment on Simple Linear Regression
Tip of the day
When tackling a statistics problem, always start by visualizing the data! A simple graph or chart can help reveal trends, outliers, and patterns that aren’t immediately obvious from raw numbers. Understanding your data visually can make the analysis much clearer!
News
The rise of AI and big data is reshaping the field of statistics. Recent trends highlight the growing need for expertise in statistical modeling, machine learning, and data visualization, with applications in healthcare, finance, and technology.

Assignment on Simple Linear Regression

EMBA Final Exam B01.1305

12 May 2021

  • Please write your name on every answer book that you use. Make sure that you number your solutions correctly.
  • Read all questions carefully.
  • Show your work so that partial credit can be given. Poorly described solutions will be penalized.
  • All questions are not of the same level of difficulty.
  • For all multiple-choice questions, one point for the right choice, the remaining points for justification.
  • There are 4 questions on this exam. You must complete all 4questions correctly to get full points (i.e.50 points) on this exam. Good Luck!

Name: ______________________________________________________________

  1. [16 points] Answer the following statistics assignment questions. Justify your answers briefly. No credit will be given if you merely provide a choice without some justification for it.
    1. [4 points]Your colleague in a financial institution says that she has been tracking the movements of the monthly returns of Facebook and Amazon stock returns. Using data on these returns over the last 10 years, she says that she has computed the COVARIANCE between these two return series and found that it is 0.00042. Since this COVARIANCE is so low and close to zero, she says that there does not seem to be any association between the two return series.

      You tell her that (choose one of the following)

      1. her reasoning is faulty because….(give a brief reason)
      2. her reasoning is correct because…(give a brief reason)
    2. Answer:

      Her reasoning is faulty. Covariance can not be compared directly as it is unbounded and hence, the absolute value of covariance gives very little information. She should have calculated correlation which is covariance adjusted by the variance and then tested for significance to conclude anything.

      correlation=linear regression

    3. [4 points] Is it possible that when you fit a simple regression model, the t-statistic for the slope coefficient is large (outside the range of (-2,2)), indicating that the X variable has a linear relationship with the Y variable, but that the R-squared value is quite low, say 8%?
      1. Yes (justify your choice with a short explanation)
      2. No (justify your choice with a short explanation)
    4. Answer:

      Yes, It may happen as the R-squared is the amount of variance explained. If the error or noise in the data is high, it can lead to low R-squared.

    5. [4 points]Your colleague is running a simple regression of Y on X. He makes a plot of the standardized residuals vs. the fitted values shown below and you observe that there is a funnel shape and so very clear evidence that there is non-constant variance in the data.
      simple linear regression1

      However, your colleague insists on going ahead and fitting the regression model without replacing the Y values with log(Y). Briefly yet clearly, describe the two errors that his resulting analysis, based on the untransformed Y variable, is likely to make.

      Answer:

      The errors his analysis is likely to make are:

      While heteroscedasticity does not cause bias in the coefficient estimates, it does make them less precise. Lower precision increases the likelihood that the coefficient estimates are further from the correct population value.

      Heteroscedasticity tends to produce p-values that are smaller than they should be. This effect occurs because heteroscedasticity increases the variance of the coefficient estimates but the OLS procedure does not detect this increase.

    6. [4 points]The regression of log(revenue of a firm) on log(R&D expenditure of firm) yields the following equation:

      Log(Revenue) = 1.3 + 0.65 Log(R&D Expenditure)

      Answer:

      In one sentence, interpret the value 0.65 of the slope in terms of the original variables “revenue of a firm” and “R&D expenditure of firm” (i.e. in terms of the unlogged variables)

      Assuming natural log (i.e., base e log), the coefficient of 0.65 means that for each unit increase in R&D expenditure, the average increase in revenue is e^0.65=1.92 times.

  2. [14 points] The marketing manager of a large supermarket chain would like to determine the effect of shelf space and whether the product was placed at the front or back of the aisle on the sales of pet food. A random sample of 12 equal-sized stores was taken and the following variables were noted:

    Y= sales=daily sales of pet food (in thousands of $)

    space=shelf space the per food in square feet

    location=0 if the pet food was placed at the back of the aisle

    location= 1 if the pet food was placed at the front of the aisle

    The output from the fitted multiple regression is shown below

    Model Summary

    SR-sqR-sq(adj)R-sq(pred)
    0.21317786.38%83.35%77.88%

    Coefficients

    TermCoef  SE CoefT-Value  P-Value  VIF
    Constant   1.3000.1578.290.000
    space     0.07400.01106.720.0001.00
    location   0.450    0.1313.450.0071.00

    Regression Equation

    sales = 1.300 + 0.0740 space + 0.450 location

    1. [3 points] The manager believes that for a fixed amount of shelf space, products placed at the front of the aisle sell more on average than products placed at the back. Is there evidence to support his belief? (Justify your answer with an appropriate number)

      Answer:

      Yes. The data contains the evidence to support his claims as the t-test for the significance of the location is statistically significant and the coefficient for the front is positive means if every other factor remains the same, the front location is expected to have higher sales than the back location.

    2. [1 point] Predict the daily sales of pet food if the product is placed at the front of the aisle and has 6 square feet of shelf space devoted to it.

      Answer:

      Predicted sales = 1.300 + 0.0740 *6 + 0.450 *1=2.194K

      The predicted sales are $2194.

    3. [5 points] For a store that places the pet food according to the plan in (iii) above (i.e. at the front of the aisle with 6 square feet of shelf space), what is the probability that the daily sales are less than $1550? (Justify your answer with an explanation)

      Answer:

      Predicted sales = 1.300 + 0.0740 *6 + 0.450 *1=2.194

      The predicted sales have a normal distribution with a mean of 2.194 and an SD of 0.2138.

      The probability that sales are less than $1550 is:

      linear regression1

      The probability is very low (p = 0.0001) that the daily sales are less than $1550.

    4. [5 points] An analyst in Ames, Iowa is provided exactly the same data for analysis and she fits the same multiple regression model as above. However, she codes her dummy variable for a location as follows:X2=location=1 if the product was placed at the back of the aisle = 0 if the product was placed at the front of the aisle

      Answer:

      She uses her model to predict daily sales of pet food if the product is placed at the front of the aisle and has 6 square feet of shelf space devoted to it. (i.e. the same characteristics as in part (ii) above)

      1. In what way would her predicted value differ from the value you obtained in (ii) above?

        Answer:

        The predicted value will not be different. However, the coefficients will vary. The intercept will now be equal to 1.3+0.450 and the coefficient of the X2 will be -0.450.

        The predicted value remains the same.

      2. What estimate would she get for the coefficient of location in her fitted regression equation?

        Answer:

        The coefficients will be:

        Intercept = 1.750

        Coef X1 = 0.074

        CoefX2 = -0.450

    5. [10 points]A real estate company has collected data on the following variables for several houses in a suburb of NYC:

      Price: the price of the house (in $)

      Story: the number of stories the house has

      Baths: the number of baths the house has

      A multiple regression fit to the above variables gave the following:

      Regression Analysis: Price versus Story, Baths

      Model Summary

      SR-sqR-sq(adj)R-sq(pred)
      53098.742.71%41.49%38.60%

      Coefficients

      TermCoefSE CoefT-ValueP-Value
      Constant-4462321492-2.080.041
      Story63097417861.510.131
      Baths42669300481.42

      Regression Equation

      Price = -44623 + 63097 Story + 42669 Baths

      1. [2 points] Which of the explanatory variables in the model are important on an individual basis, after accounting for the other variables?

        You must state a number (or numbers) based on which you give your answer

        Answer:

        The most important variable is Story. This is based on the p-values of the t-test. The p-value for Story is lower than Baths which makes it more effective. (Although, both of them are not statistically significant.)

      2. [4 points] (Answer this question using the output on the earlier page as is, regardless of whatever you may have concluded in (a) above) The company has a house in the suburb that it wishes to sell. This house is 2 stories tall and has 1 bath. Based on the FULL MODEL on the previous page, make a suggestion for what price the agency should list the house at such that the agency is neither underselling the house nor overpricing it significantly. It is fine if your answer is a range of values. YOU MUST PROVIDE JUSTIFICATION IN A FEW BRIEF SENTENCES FOR HOW YOU CAME UP WITH YOUR VALUE (OR RANGE OF VALUES)

        Answer:

        Price = -44623 + 63097 Story + 42669 Baths

        Price = -44623+63097×2+42669×1=124240.

        The fitted value is $124,240 which is the suggested price.

        If a range of values is required, a 95% Prediction interval is calculated as:

        Lower Limit = 124,240 – 53098.7*1.96 = $ 20,166.55

        Upper limit = 124,240 + 53098.7*1.96 = $ 228,313.5

        The fitted value is suggested as the sale price as this is the expected value of the price of the property. But if that is not agreed price, a range of values given by the prediction interval captures the value of the property with 95% confidence.

      3. [4 points] When the analyst who carried out the analysis presents the model to the real estate agents at the company, the one agent says “I am quite puzzled by this. The variable “baths” has a t-statistic value within (-2,2),but I would definitely expect the number of bathrooms a house has to be related to its price

        Give a brief but clear response to the agent to that will clear up their confusion

        Answer:

        The data indicates that number of bathrooms may have increasing relationship with the house price, but this variable is not able to explain significant proportion of the variation in the house price which must be related through a lot of factors as well as it may have some interaction effect with other variable. This analysis is not a proof of causation and hence, cannot be taken as such. More variables might be used to explain the trend in house prices and then this relationship can be captured better.

    6. [10 points] This question builds on the airport security problem in question 2 from HW 3. The paragraph below, describing the setup, is identical to that in the HW.

      In November 2001, just after the 9/11 attacks, the NYTimes published an article titled “A small dose of common sense would help Congress break the deadlock over airport security”. The article considered the different factors that could impact the quality of security screening at airports. One of the factors that it considered was the turnover rate (a measure of how quickly employees leave the job) of airport security personnel and its potential impact on how good the security screening was. The article mentioned a study that considered the turnover rate at 19 airports across the country and also the violations detected (per million passengers) at each of those airports; the article reported that the study found that a lower turnover rate (i.e. employees stay in their job for a longer period) was associated with a greater likelihood of detecting violations (i.e. a large number of violations detected per million passengers) and thus advocated for measures that would reduce the turnover rate in order to increase the quality of the security screening.

      The original article in the newspaper also had the data for these two variables across the 19 airports and you can find that data in the file AirportViol.

      Below is a scatter plot of the violations detected per million passengers (Y) versus the turnover rate (X), as well as the output from a simple regression model fit to the data

      simple linear regression2

      Regression Analysis: ViolDet versus TurnRate

      Model Summary

      SR-sqR-sq(adj)R-sq(pred)
      7.5085016.11%11.18%0.00%

      Coefficients

      TermCoefSE CoefT-ValueP-ValueVIF
      Constant21.87 3.037.220.000
      TurnRate-0.03040.0168-1.810.0881.00

      Regression Equation

      ViolDet = 21.87 - 0.0304 TurnRate

      1. [2 points] Does the sign of the estimated slope coefficient support the argument that article made about the relationship between violations detected per million passengers and the turnover rate? Explain your answer clearly in a sentence or two

        Answer:

        The sign of the estimated slope coefficient supports the argument that article made about the relationship between violations detected per million passengers and the turnover rate as the coefficient is negative. Negative coefficient indicate lower turn rate means higher violation detection.

      2. Based on the regression output, is there evidence that there is a linear relationship between these two variables?

        Answer:

        There is no evidence for a linear relationship based on the regression output at 5% level of significance. The t-test has t-value -1.81, p=.08 which is higher than 0.05.

        The original NYTimes article (snapshot below; you do NOT have to read the article, I am just showing it for clarity) also provided the locations of each of the 19 airports for which the data had been collected.

        simple linear regression3

        Using this additional information on the location of each airport, I categorized the airports into one of two categories:

        Airport in a major East or West coast city

        Airport not in a major East or West coast city

        I then created a dummy variable for “location in a major coastal city” to incorporate this information into the model, with the coding as

        Coast= 1 if Airport in a major East or West coast city

        Coast=0 if Airport not in a major East or West coast city

        You can see the first few rows of the additional variable in the snapshot below:

        simple linear regression4

        I then ran a multiple regression of the violations detected on the turnover rate AND the location variables and got the following output:

        Regression Analysis: ViolDet versus TurnRate, Coast

        Analysis of Variance

        Model Summary

        SR-sqR-sq(adj)R-sq(pred)
        5.4743358.03%52.79%44.51%

        Coefficients

        TermCoefSE Coef T-Value P-ValueVIF
        Constant13.613.024.500.000
        TurnRate-0.00960.0133-0.720.4831.18
        Coast10.922.734.000.0011.18

        Regression Equation

        ViolDet = 13.61 - 0.0096 TurnRate + 10.92 Coast

      3. [2 points] Is there evidence of a relationship between violations and turnover rate in this multiple regression model? Provide brief justification

        Answer:

        No. There is no evidence as shown by t-test which has t-value = -0.72, p=.48. This supports no significant linear relationship between TurnRate and ViolDet.

      4. [2 points] Is there evidence of a relationship between violations and the location variable in this multiple regression model? Provide brief justification

        Answer:

        Yes. There is evidence for a relationship between violations and the location variable in this multiple regression model which can be seen by t-value of 4.00, p=.001. This is significant at all reasonable level of significance and hence, supports the claim of relationship between the variables.

      5. [4 points] What do you now think about the conclusion of the policy prescription of the article, viz., advocating for measures that would reduce the turnover rate in order to increase the quality of the security screening? What is most likely driving the relationship between violations and location, as found in (iii)?

        Give some justification for your answer

        Answer:

        The conclusion of the policy prescription of the article, viz., advocating for measures that would reduce the turnover rate in order to increase the quality of the security screening was not based on rigorous analysis of the data. The relationship is mainly due to the location of the airports.

        This relationship may be driven by the fact that most major cities are located around the coast, along with most of the travelers entering through these airports, so the “number” of violations is expected to be high. A better parameter would be to test for the proportion of violations per million of checks. Hence, the numbers would be biased toward these airports.