Case: Absenteeism
Hi, I’m Pam Poovey, the director of Human Resources for Ingels Sherman Inter-national Shipping. Recently I have been concerned about how often our workers have been taking days for either personal or family health reasons (i.e. Family and Medical Leave). The company has a very generous leave policy, but I have noticed that many more people are using the leave policy than they used to. I suspect that at least some of the employees have been taking advantage of the company’s policy but I would like to see if there is some sort of pattern in what kind of employee is more likely to take days . Our department has recorded whether or not each employee took a leave day last quarter along with some basic information that we think may be related for each employee. We recorded their current salary, whether the employee is full time or part time, age, and whether or not the employee got a raise or promotion in the last five years.
Because this project is sensitive, we would like you to be an external consultant to help us analyze this data. Primarily, we would like to know if any of the factors we have access to can help us identify employees who are most likely to use a leave day. Secondarily, we would like to know how accurately the data can be used to predict whether or not an employee is likely to take a leave day during a given three month period. As you might imagine, we want to avoid talking to employees to further investigate this issue unless we are very con dent that they are likely to use the leave policy in three months. For that reason, before we use the model you develop to start investigating possible abuse the policy, is there a way to tune the model so that at most 10% of the people who don’t use the leave policy are incorrectly predicted by the model to use the leave policy? If the model can be tuned that way, how good will the model be at accurately identifying people who actually will use the leave policy in three months if we do that?
Senior Analyst’s Objectives
1. Give a brief summary of the data using both numeric and visual summaries
2. Investigate the relationship between the predictors and the use of the leave policy to determine what model would be most appropriate
3. Evaluate and give a description of the utility and validity of the model
4. Analyze and interpret the relationship between the use of the leave policy and the predictors for the client
5. Be sure to comment on any issues or weaknesses the model may have so that the client understands the restrictions of the analysis
Summary of the Analysis
The main objective of the Analysis is to identify if there is a relationship between the employee taking a day off and some factors such as: the employee’s age , whether the employee’s is full time or part time and whether the employee got a raise or promotion in the last five years.
1.1 Brief summary of the data
Variable | Description of the variable |
Took leave | This is a categorical variable , describing whether |
the employee has taken a leave or not , if yes, the | |
variable takes ‘yesleave’ and if not, the variable | |
takes ‘noleave’. | |
Salary | This is a numerical variable , recording the |
employee’s salary , the minimum salary is | |
20004.5 and the maximum salary is 69997. | |
Employment Status | This is a categorical variable , describing the |
employee’s employment status. It takes part time | |
when the employee is working part time and | |
takes Fulltime when the employee is working for | |
Fulltime. | |
RaiseorPromo | This is a categorical variable , describing whether |
the employee has got a promotion or raise during | |
the last five years. If the employee has got a | |
promotion , the variable takes the value Yes , if | |
not not the variable takes the value No. |
➔ From the bar plot , we can see that around 800 employees took a leave , and around 200 employees did not take a leave from their job.
➔ From the bar plot , we can see that around 780 employees work full time and around 220 work part time.
➔ From the bar plot , we can see that around 810 employees have got no promotion or raise during the last five years , while 200 did.
➔ From the box plot , we can see that the average salary for the employees who took leave is $45000 while for those who did not is around $50000.
➔ From the box plot , we can see that the average age is 35 for both the employees who took a leave and those who did not.
1.2 Relationship between the predictors and the leave policy
From the correlation analysis, we can see that there exists a positive relationship between the employee’s age and employee’s salary.
For the categorical variables, the type of relationship cannot be identified, but it is predicted that the probability of employee’s having a leave increase when their age increases, and the type of employment is part time.
1.3 Utility and validity of the model
The model we would set up to find the relationship between the leave policy and the predictor variables , is the logistic regression model , the variable we have as independent which the variable we are trying to predict is the leave.
This model is crucial, because it will enable us to find which predictor variable has an impact on employee’s leave.
1.4 The model interpretation
According to the results of our model, when the employee is working full time, the probability that the employee took a leave is higher.
In addition, when the employee has not received any raise or promo during the last 5 years, the probability of taking a leave increase.
Furthermore, when the employee’s age increases or the salary increases, the employee’s probability of taking a leave decreases slightly.
1.5 problems of the model
The data is not that much representative to assess the leave policy and so the relationship between the leave policy and the predictor variables.
Since most of the employees have not taken a leave and, that is , the number of employees who have not taken a leave is higher than the number of employees who have taken the leave , the data representation and so the assessment of the leave policy is difficult to measure.
II. Statistical Appendix
Visual summary of the data
To visualize the data , we used bar plots and boxplots.
Bar plots and boxplots have different objective of use:
a) The bar plots are used for categorical variables, categorical variables are variables that have two or more categories, for instance for the leave, we have two categories: either yes when the employee has taken a leave or No when the employee has not.
Bar plots are used to show graphically the relationship between a categoric variable and a numeric, each category is represented as a bar, and the size of the bar is representing the numeric value.
b) The box plots are used for continuous variables, continuous variables are variables that can take any continuous numerical value , instance the salary of the employee is a continuous variable.
Box plots are used to depict graphically groups of numerical data through their quartiles.
Relationship between the predictors and the leave policy
To analyze the relationship between the predictors and the leave policy , we would use the correlation coefficient.
The correlation coefficient is a numerical measure of the statistical relationship between the variables.it takes values between -1 and 1 , the closer to -1 or 1 , the higher is the relationship between the variables.
III. Exhibit
Asociación de probabilidades predichas y respuestas observadas
Concordancia de porcentaje | 64.0 | D de Somers | 0.279 |
Discordancia de porcentaje | 36.0 | Gamma | 0.279 |
Porcentaje ligado | 0.0 | Tau-a | 0.089 |
Pares | 169744 | c | 0.640 |
Estimadores de ratio de probabilidades | ||||||||||||||||||
Estimador | Límites de confianza de | |||||||||||||||||
de punto | Wald al 95% | |||||||||||||||||
Efecto | ||||||||||||||||||
employmentStatus FullTime vs PartTime | 1.450 | 0.771 | 2.726 | |||||||||||||||
raiseOrPromo no vs yes | 1.866 | 1.210 | 2.879 | |||||||||||||||
salary | 1.000 | 1.000 | 1.000 | |||||||||||||||
age | 0.995 | 0.970 | 1.021 | |||||||||||||||
Análisis de efectos Tipo 3 | ||||||||||||||||||
Chi-cuadrado | ||||||||||||||||||
de Wald | ||||||||||||||||||
Efecto | DF | Pr > ChiSq | ||||||||||||||||
employmentStatus | 1 | 1.3269 | 0.2494 | |||||||||||||||
raiseOrPromo | 1 | 7.9647 | 0.0048 | |||||||||||||||
salary | 1 | 14.7770 | 0.0001 | |||||||||||||||
age | 1 | 0.1320 | 0.7164 | |||||||||||||||
Análisis de estimación de verosimilitud máxima | ||||||||||||||||||
Error | Chi- | |||||||||||||||||
estándar | cuadrado | |||||||||||||||||
Parámetro | DF | Estimación | de Wald | Pr > ChiSq | ||||||||||||||
Intercept | 1 | -0.0717 | 0.5912 | 0.0147 | 0.9035 | |||||||||||||
employmentStatus | FullTime | 1 | 0.3712 | 0.3223 | 1.3269 | 0.2494 | ||||||||||||
employmentStatus | PartTime | 0 | 0 | . | . | . | ||||||||||||
raiseOrPromo | no | 1 | 0.6241 | 0.2211 | 7.9647 | 0.0048 | ||||||||||||
raiseOrPromo | yes | 0 | 0 | . | . | . | ||||||||||||
salary | 1 | -0.00004 | 0.000011 | 14.7770 | 0.0001 | |||||||||||||
age | 1 | -0.00469 | 0.0129 | 0.1320 | 0.7164 | |||||||||||||