STAC32 Assignment 3

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Why do people leave a company?

At the end of 2018, all the employees of a large company completed an anonymous survey. The company had nearly four thousand employees at that time. The results of the survey, in the file http://ritsokiguess.site/datafiles/survey-responses.csv, contain the following variables:

the gender the employee identifies as
the department in the company where they work
sentiment: whether they are happy in their job (1 = strongly disagree to 10 = strongly agree)
intention: whether they intend to find a new job (1 = strongly disagree to 10 = strongly agree)
iid: an employee ID code (to protect the employee’s identity, but the same code refers to the same employee wherever it appears)
level: a coarser subdivision of the sentiment scale into low, medium, and high (we ignore this column).

The employee survey data is in http://ritsokiguess.site/datafiles/survey-responses.csv.

Some of the employees, in fact over a thousand of them, left the company within the two years after the survey. A second data file shows which employees left, and the date they left the company. This has two columns: the same iid as in the employee survey data (same iid implies same employee), and the date the employee left (as year-month-day). This data file is in http://ritsokiguess.site/datafiles/departure-dates.csv.

(a) (1 point) Read in and display some of the employee survey data.

As usual:

my_url1 <- "http://ritsokiguess.site/datafiles/survey-responses.csv"
survey <- read_csv(my_url1)

Rows: 3770 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): iid, gender, department, level
dbl (2): sentiment, intention

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

survey

There are 3,770 rows (“nearly four thousand”), along with all the variables listed. (This part and the next one are only one point, so saying this is optional here and there, but you should note it for yourself.)

(b) (1 point) Read in and display some of the employee departure data.

Giving you an easy ride so far:

my_url2 <- "http://ritsokiguess.site/datafiles/departure-dates.csv"
departures <- read_csv(my_url2)

Rows: 1354 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): iid
date (1): departure_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

departures

1,354 rows (“over a thousand”) and the two columns promised. Note that departure_date got read in as a proper R date, because it was written as year-month-day.

If you are really keen, you can note that the first employee in the departure dataframe is also the first employee in the survey dataframe.

(c) (4 points) Create a combined dataframe that contains, for each employee, all the information in the survey dataframe, plus their departure date (if they left the company). What value does your combined dataframe have for the departure date of employees who did not leave the company within the next two years?

This is looking up the employees in one dataframe that are also in a second dataframe, so left_join is what we need. Note that the only column that is in both dataframes is iid and it is called the same thing in both places, so left_join will use this to match the employees (and thus no join_by at all is needed):

survey %>% left_join(departures) -> employees

Joining with `by = join_by(iid)`

employees

The greatest understanding is shown by saying why you didn’t need a join_by in your left_join. The second-best answer explicitly names the column that the two dataframes have in common:

survey %>% left_join(departures, join_by("iid"))

You can get yourself full points by doing it this way and explaining why you chose to do so, for example by saying that you wanted to match employee ID in the two dataframes, and you wanted to be clear about what you were doing.

Three points available for the above. The last point is for saying that the combined dataframe has a lot of missings in the departure date column, and that these are for the employees that didn’t leave the company (were still working there at the end of 2020). To justify this, you can go a couple of ways: you can use what you know about left_join to say that if the employee is not found in the departures table, their departure date will be missing. Or you can count up how many missings there are; the easiest way to do this is with summary:

summary(employees)

     iid               gender           department           level          
 Length:3770        Length:3770        Length:3770        Length:3770       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   sentiment        intention      departure_date      
 Min.   : 1.000   Min.   : 1.000   Min.   :2019-01-31  
 1st Qu.: 7.000   1st Qu.: 3.000   1st Qu.:2019-08-31  
 Median : 8.000   Median : 4.000   Median :2020-01-31  
 Mean   : 7.586   Mean   : 4.193   Mean   :2020-02-12  
 3rd Qu.: 9.000   3rd Qu.: 6.000   3rd Qu.:2020-06-30  
 Max.   :10.000   Max.   :10.000   Max.   :2020-12-31  
                                   NA's   :2416

and then think a bit: there were 3770 employees, of which 1354 left, so this many must still be at the company:

3770 - 1354

[1] 2416

and that’s how many missing values there are in departure_date.

(d) (3 points) Make a suitable plot that compares the employees’ `intention` of leaving between the employees that actually left and the ones that stayed. Hint: `is.na` returns `TRUE` if its input is missing and `FALSE` if it is not missing. You might like to create a new column that will help you make your plot.

The obvious thing to apply is.na to is the departure dates:

employees %>% mutate(is_missing = is.na(departure_date))

You might need to scroll across to check that the new column is TRUE if the departure date is missing (that is, the employee has not left the company) and FALSE if the departure date is not missing (that is, the employee did leave sometime). One point for getting this far.

Now you can use your new column and intention to make a boxplot:

employees %>% mutate(is_missing = is.na(departure_date)) %>% 
  ggplot(aes(x = is_missing, y = intention)) + geom_boxplot()

Remember that when ggplot is on the end of a pipeline, the dataframe that is used is the one that came out of the previous step of the pipeline. (You also have the option of saving the dataframe with missing in it, and using that saved dataframe as the first input to ggplot.) Two more points for the graph.

(e) (2 points) What do you learn from your plot? Does this make sense in the context of the data? Explain briefly.

The median intention (to leave) is higher when missing is false: that is to say, if the employee actually did leave the company, this showed up in the survey as their intention to leave being higher on average. This makes sense; leaving a job is something that you plan for (or at least think about) in advance, so you would expect employees that ended up leaving to have a higher intention to leave before they actually did.

Extra 1: you may have found that the logic to your boxplot was difficult to entangle, because you had to stop and figure out what “true” and “false” actually meant. To make a nicer plot, we can use is.na combined with ifelse. ifelse is like =IF on a spreadsheet: it takes three inputs. The first one is something that is either true or false; the second one is the value to use if the first thing is true, and the third one is the value to use if it is false. We can use this idea to make our new variable have better names, like this:

employees %>% mutate(status = ifelse(is.na(departure_date), "still there", "left"))

Now our new variable has values that tell us what they mean.¹

Then the boxplot is easier to interpret:

employees %>% mutate(status = ifelse(is.na(departure_date), "still there", "left")) %>% 
  ggplot(aes(x = status, y = intention)) + geom_boxplot()

The people that left had a higher intention to leave when they did the survey, at least on average.

Extra 2: the logic to a boxplot is “if x, then y”. Here, if an employee left, then they had a higher intention of leaving. That may feel backwards to you: an employee leaving or not seems more like the outcome, and their score on a survey before they left (or did not leave) seems more explanatory. To say something about things that way around takes something like logistic regression, which you’ll learn about in D29. A boxplot is really designed for the situation where the categorical thing is explanatory, like treatments in a designed experiment, and the quantitative variable is a response.

The appropriate picture is (well) beyond our scope at the moment, but it’s a nice picture, so I’ll make it for you. First we fit a logistic regression. I have now decided to save my dataframe with is_missing in it (so I’m now thinking that I should have done this in the first place):

employees %>% mutate(is_missing = is.na(departure_date)) -> d
employees.1 <- glm(as.numeric(is_missing) ~ intention, data = d, family = "binomial")
summary(employees.1)


Call:
glm(formula = as.numeric(is_missing) ~ intention, family = "binomial", 
    data = d)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.69179    0.08298   20.39   <2e-16 ***
intention   -0.25756    0.01703  -15.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4923.1  on 3769  degrees of freedom
Residual deviance: 4680.2  on 3768  degrees of freedom
AIC: 4684.2

Number of Fisher Scoring iterations: 4

This shows a negative (and strongly significant) slope, so an employee with a higher intention to leave has a lower probability of still being with the company after two years (or, a higher probability of leaving during that time).

Now I can use the marginaleffects package to draw a nice picture:

plot_predictions(model = employees.1, condition = "intention")

As intention to leave goes up, the probability of still being at the company goes down. The grey area around the predictions is a confidence interval; you see that the interval is short, because there are so many employees.

2 Escaping from an oil rig

People who work in dangerous environments, like an offshore oil rig, need to be able to escape to safety quickly. Every so often, these people are required to take part in a simulated escape exercise (as training for a real escape, should it ever be necessary). In one such exercise, a sample of workers on an offshore oil rig were timed to see how fast they could escape to safety. The data are in http://ritsokiguess.site/datafiles/ex01.36.csv.

(a) (2 points) Read in and display some of the data. How many observations are there?

my_url <- "http://ritsokiguess.site/datafiles/ex01.36.csv"
escape <- read_csv(my_url)

Rows: 26 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): escape_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

escape

There are 26 observations (rows) in the dataframe (the second point; the first is for reading the file properly).

There is only one column, so in fact you could use any of the read_ things to read in the file (they differ in terms of what is between the data values, but there is no “between” here).

(b) (2 points) Make a suitable plot of the escape times. From your plot, how will the mean and median compare? Explain briefly.

One quantitative variable, so a histogram:

ggplot(escape, aes(x = escape_time)) + geom_histogram(bins = 6)

Not substantially more bins than this.

The distribution is apparently close to symmetric, so the mean should be about the same as the median. Your graph (and therefore your conclusion) may depend on the number of bins you chose.

If you want to check, which you can do for yourself, you know how to do that:

escape %>% 
  summarize(mean_time = mean(escape_time), 
            med_time = median(escape_time))

The mean is slightly bigger, but there is not much difference, given the variability in the escape times.

An alternative plot is a one-sample boxplot:

ggplot(escape, aes(x = 1, y = escape_time)) + geom_boxplot()

This distribution is pretty much symmetric. If this is the plot you drew, you don’t really have any justification for saying that the mean and median will be different. (Note that any asymmetry you might have seen on your histogram, depending on how many bins you had, does not show up here.)

Have a conclusion about the mean and median, and have a (good) reason for drawing that conclusion.

(c) (2 points) Why can we expect a t-procedure to be reasonably accurate for these data? Explain briefly.

There are two things to consider:

whatever you said about the histogram or boxplot before: is it normal in shape, or at least symmetric?
the sample size of \(n = 26\) is likely to be large enough to overcome whatever non-normality you thought there was.

Hence, a t-test or t confidence interval should be perfectly good.

Extra: you have also the option of obtaining a bootstrap sampling distribution of the sample mean, and assessing that for normality. Your standards here for normality should be higher:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(escape$escape_time, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

That is as bell-shaped as you could reasonably wish for, and thus our t-procedure should be fine.

(d) (3 points) A standard escape time for this kind of exercise is 6 minutes (360 seconds). Is there evidence that the mean escape time for all oil workers is greater than this? Your answer should be convincing to your reader.

This calls for a hypothesis test (“evidence”), and the test needs to be one-sided because we are looking for evidence that the mean is greater:

with(escape, t.test(escape_time, mu = 360, alternative = "greater"))


    One Sample t-test

data:  escape_time
t = 2.2382, df = 25, p-value = 0.01718
alternative hypothesis: true mean is greater than 360
95 percent confidence interval:
 362.5323      Inf
sample estimates:
mean of x 
 370.6923

Go through the procedure:

the null hypothesis is that the population mean is 360 (seconds), with the alternative being that it is greater than 360.
the P-value is 0.017, which is less than 0.05 (or compare it to the \(\alpha\) that you chose, if different), so we reject the null hypothesis in favour of the alternative
therefore we conclude that there is evidence that the mean escape time for all oil workers (of which these are a sample) is greater than 360.

To be convincing, your answer needs to state the hypotheses being tested, which one of them you are going with and why, and what that means in terms of escape times of oil workers. You need the last one because somebody needs to use your test result to make a decision; in this case, that might be to give oil workers extra escape training so that in the event of a real emergency, they will be able to escape quickly enough.

The point of a hypothesis test is to draw a conclusion about a population, in this case all oil workers, based on a sample from that population. Drawing a conclusion about this sample is a “well, duh” moment: we know exactly what the sample mean is, but that is only of value to us in so far as it enables us to say something about the population from which the sample came.

See the Extra to the next part for another acceptable way of doing this.

(e) (2 points) Obtain a 90% confidence interval for the population mean escape time.

Confidence intervals are two-sided things, so re-do your t.test without the one-sided alternative, and with the right confidence level (since this one is not the default 95%):

with(escape, t.test(escape_time, conf.level = 0.90))


    One Sample t-test

data:  escape_time
t = 77.598, df = 25, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 362.5323 378.8523
sample estimates:
mean of x 
 370.6923

A 90% confidence interval for the population mean escape time is 362.5 to 378.9 seconds. You need to state the confidence interval, and round it off to a suitable number of decimal places, without being explicitly asked to.² Remember your audience: some manager of oil workers wants to know how much slower they were on average than the target time (somewhere between about 0 and 20 seconds slower), because the mananger needs to decide on the level of extra training needed, or maybe they come to the conclusion that being up to 20 seconds slower, rather than, say, a minute slower, is acceptable. The manager does not need to be hunting in the output for an answer, and they do not need it to four decimals. They need something they can work with, and they need it now.

To think about how many decimals to give: go back to the data, which was measured to a whole number of seconds. With typical sample sizes, such as the one we have here, you can give one more decimal place than the data. Any more than that is just noise, and has no value to anyone. You might even round the ends of the confidence interval off to the nearest whole number (363 to 379 seconds), on the basis that this is likely to be as accurate as any consumer will want. If this had been a physics experiment,³ where high accuracy is expected, it would be a different story, but as it is, high accuracy is not justified.

Extra: in both of the previous two parts, an alternative way is to use the dollar sign to name the column you want from the dataframe you want:

 t.test(escape$escape_time, mu = 360, alternative = "greater")


    One Sample t-test

data:  escape$escape_time
t = 2.2382, df = 25, p-value = 0.01718
alternative hypothesis: true mean is greater than 360
95 percent confidence interval:
 362.5323      Inf
sample estimates:
mean of x 
 370.6923

and

 t.test(escape$escape_time, conf.level = 0.90)


    One Sample t-test

data:  escape$escape_time
t = 77.598, df = 25, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 362.5323 378.8523
sample estimates:
mean of x 
 370.6923

with identical results. There isn’t really a good reason to prefer my with approach to this one for a one-sample procedure (they are about the same amount of typing), but as soon as you find yourself typing a dollar sign twice in the same line of code, then you ought to think about how you can avoid that.

Footnotes

You might have to think a bit to get the “left” and “still there” the right way around, but the thinking is confined to this bit.↩︎
Likewise, when you do a test, you should say what the P-value was, rounded off enough but not too much (so that you can still see whether it was less than 0.05, but without burdening your reader with extra decimal places). In this case, you could round the P-value off to 0.02 and still see that it was less than 0.05. I like having one more decimal place so that my reader has a decent sense of how much less than 0.05 it was, but citing the P-value as 0.01718 is definitely too much.↩︎
In a physics experiment, the data might have been recorded to four decimals and they might not vary very much, so you would then be justified in giving (say) five decimals on your confidence interval. In that sort of case the answer might be 1.00027 to 1.00032, and then you really need those five decimals.↩︎