STAD29 Assignment 2

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)
library(marginaleffects)

Management consultants

A large management consulting company investigated the relationship between amount of job experience (months) for a junior consultant and the likelihood of the consultant being able to perform a certain complex task (the same task for each junior consultant). The data are in http://ritsokiguess.site/datafiles/ex_13-25.csv. The column completion indicates whether the junior consultant was a Success or a Failure at the task.

(1 point) Read in and display (some of) the data.

As expected:

my_url <- "http://ritsokiguess.site/datafiles/ex_13-25.csv"
task <- read_csv(my_url)

Rows: 28 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): completion
dbl (1): experience

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

task

Give the dataframe a suitable name of your choosing. There were 28 junior consultants in the investigation altogether.

(3 points) Make a suitable graph of the data, interpret the graph, and explain briefly why the graph does not tell you what you really want to know (or does tell you, if you think it does).

One point each for the three things:

First, the graph is (evidently) a boxplot:

ggplot(task, aes(x = completion, y = experience)) + geom_boxplot()

Second, this indicates that junior consultants who succeeded at the task tended to have more experience than the ones who didn’t. (Careful with the wording here: see below.)

Third, on a boxplot, the categorical variable is explanatory, so this plot says: if a junior consultant succeeded at the task, then they tended to have more experience than one who didn’t succeed. But in this context, success or failure is the response (outcome), which presumably depends on the amount of experience (explanatory). Hence, the boxplot gets explanatory and response the wrong way around. What we really want to know is “does having more experience make it more likely that the junior consultant will succeed at the task?”, and that is not what the boxplot is telling us.

For the third point, say something about explanatory and response being the wrong way around (or the cause and effect being reversed) and why.

Extra: you might instead make a plot like this, with experience on the \(x\)-axis:

ggplot(task, aes(x = experience, y = completion)) + geom_point()

Now the cause and effect is the right way around: if the junior consultant has little experience, they are more likely to be in the bottom row of points (failure), but if they have a lot, they are more likely to be in the top row (success). If you came up with a graph like this, your third point comes from an explanation of why the cause and effect is the right way around for your graph.

I don’t think flipping the axes on the boxplot achieves this, because it is still looking at the distribution of experience given completion, which is not what we want here.

(2 points) Fit a suitable logistic regression, and display the results. (Hint: what is your response variable? Are its values what they need to be?)

This. Taking the hint first: the response variable is completion, which needs to be either 0-or-1 or a factor. It is neither of those (it is text: see the chr at the top of the column when you read it in), so make it a factor in the model:

task.1 <- glm(factor(completion) ~ experience, data = task, family = "binomial")
summary(task.1)


Call:
glm(formula = factor(completion) ~ experience, family = "binomial", 
    data = task)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -3.02597    1.25190  -2.417   0.0156 *
experience   0.17303    0.06773   2.555   0.0106 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38.816  on 27  degrees of freedom
Residual deviance: 28.950  on 26  degrees of freedom
AIC: 32.95

Number of Fisher Scoring iterations: 4

If you forget the factor thing, this will happen:

task.0 <- glm(completion ~ experience, data = task, family = "binomial")

Error in eval(family$initialize): y values must be 0 <= y <= 1

This is not the most helpful error message, because it doesn’t tell you that the response variable can also be a factor, but it does tell you that something is up with the response variable, and you can go back to your lecture notes to find out how you can fix it.

(2 points) How does your output tell you that more experience goes with a greater probability of success at the task? There are two things to say, neither of which have anything (here) to do with statistical significance.

The second sentence rules out discussion of the (small) P-value. No points for that. One point each for the two things below.

The first thing to do is to figure out what you are modelling the probability of. The response variable is categorical, with two levels Failure and Success; the first of these is the baseline, so we are modelling the probability of the other one. That is, we are modelling the probability of Success.

The second thing is that because the Estimate for experience is positive (0.173), we can now say that as experience increases, the probability of success at the task increases.

Note that if you don’t say the first thing, you don’t have a proper justification of the second thing: for all you know, you might be modelling the probability of failure, unless you rule that out first.

Extra: the statistical significance of the slope (P-value 0.011) is of course important; this is how we know that the amount of experience really does have something to do with success at the task, and therefore that inferring that greater experience goes with a greater probability of success is a reasonable thing to do. But this question is not asking about that.

(3 points) Using a suitable package, make a plot of success probability as it depends on experience, including a (95%) confidence interval for success probability for each amount of experience. How does your plot suggest that there is a significant relationship between experience and success at the task?

This is exactly what plot_predictions from the marginaleffects package does. The first input is the fitted model, and the condition input is the explanatory variable (only one, here):

plot_predictions(task.1, condition = "experience")

Two points for that. (This is much easier than trying to build it yourself; you may have seen a case when we had no alternative but to do that.)

What you want for the last point is something that says that the probability of success is really changing as experience increases. I think the best way to do that is not to look at the predictions themselves (they are going up for the same reason as we saw in the previous question), but to look at the confidence intervals. When experience is small, the probability of completing the task is estimated to be less than 0.5 (because the confidence interval does not go up that far for less than about 10 months of experience), but when experience is large, the probability is estimated to be greater than 0.5 (for greater than about 25 months of experience). Another way to say this is that the confidence intervals for small experience and large experience definitely do not overlap.

Extra 1: there are 28 individuals in this dataset, which is not very many for a logistic regression (each junior consultant only succeeds or fails at the task, rather than, say, being given a mark out of 100, so each individual doesn’t provide much information). Hence the confidence intervals are rather wide. For example, a junior consultant with 30 months of experience is estimated to have a probability of succeeding at the task between about 0.6 and 0.95, which is not exactly nailing down the probability very precisely. Having said that, though, the effect of experience is large enough to be significant even for this small sample size: we are in the usual position of knowing that there is an effect, but not knowing very much about how big the effect is.

Extra 2: you might be confused by the estimated probabilities (the black line) not being in the middle of their intervals, particularly out near the extremes. We have so far seen confidence intervals that are estimate plus or minus something times their standard error. What is happening here is that the intervals are on the scale of log-odds, and there the estimated log-odds really is in the middle of the interval, but the estimates and confidence interval endpoints are then turned back into probabilities, and then this is no longer true. For example, the predicted success probability for a junior consultant with 40 months of experience is close to 1, but the interval cannot go any higher than 1, so the interval in probability terms goes down further than it goes up from the estimate.

Daphnids

Daphnids are small crustaceans sometimes known as “water fleas”. Daphnids were exposed to one of several different concentrations of a toxic substance (in column dose of our data) for one of two times, 24 or 48 hours. Twenty daphnids were exposed to each combination of concentration and time. At the end of the time period, each daphnid was classified as “mobile” or “immobile”. The column no denotes the number of immobile daphnids. The data are in http://ritsokiguess.site/datafiles/drc_daphnids.csv.

(1 point) Read in and display (some of) the data.

The usual gimme first point:

my_url <- 'http://ritsokiguess.site/datafiles/drc_daphnids.csv'
daphnids <- read_csv(my_url)

Rows: 16 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): time
dbl (3): dose, no, total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

daphnids

Extra: the word “immobile” in this context seems to mean “apparently dead”, as in “not moving”. In other words, this is the same kind of study that investigates doses of poison required to kill rats. It is, as you see, a designed experiment, with the values of dose and time chosen by the experimenter. You might guess this, because for both values of time, the number in the no column increases as the dose increases.

(2 points) Does each row of your dataframe contain one observation, or more than one? Explain briefly.

In the context of this study, one observation is one daphnid, which either is mobile or immobile at the end. But our data contain summaries of the number out of all 20 daphnids at that dose and time, with a count of how many fall into each response category. So each row of the dataframe contains 20 observations (the number in the total column), not just one.

The clue that there is more than one observation per row is usually that you have frequencies for the response categories, which you do here (the no and total columns contain numbers rather than category levels). If there were only one observation per row, you would have a column called something like status with values mobile and immobile, in the same way that our first lecture example contained lived and died for each rat when there was only one per row.

(3 points) Fit a model predicting the probability that a daphnid is immobile given its dose and time. Display the results.

This means using a two-column response, made with cbind. The easiest way to do this is to put the cbind right inside the glm. The two columns have to be the number of “successes” first (the number of immobile daphnids at that dose and time), and then the number of “failures”, the number of mobile ones, which is a calculation: the total minus the number of immobile ones. If you get these the other way around, you are predicting the probability of a daphnid being mobile, which is the opposite of what we want.

daphnids.1 <- glm(cbind(no, total - no) ~ dose + time, data = daphnids, 
                  family = "binomial")
summary(daphnids.1)


Call:
glm(formula = cbind(no, total - no) ~ dose + time, family = "binomial", 
    data = daphnids)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.987e+00  3.418e-01  -8.739  < 2e-16 ***
dose         5.650e-04  6.864e-05   8.231  < 2e-16 ***
time48h      1.625e+00  3.273e-01   4.966 6.85e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 181.017  on 15  degrees of freedom
Residual deviance:  37.521  on 13  degrees of freedom
AIC: 79.597

Number of Fisher Scoring iterations: 5

This way, it turns out, works nicely with plot_predictions, which we will be doing shortly. If you watched the video instead of coming to lecture, you probably saw another way to do it, which is to create the two-column response outside of the dataframe, such as like this:

daphnids %>% mutate(non = total - no) %>%
  select(no, non) %>% as.matrix() -> response
daphnids.1a <- glm(response ~ dose + time, data = daphnids, 
                  family = "binomial")
summary(daphnids.1a)


Call:
glm(formula = response ~ dose + time, family = "binomial", data = daphnids)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.987e+00  3.418e-01  -8.739  < 2e-16 ***
dose         5.650e-04  6.864e-05   8.231  < 2e-16 ***
time48h      1.625e+00  3.273e-01   4.966 6.85e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 181.017  on 15  degrees of freedom
Residual deviance:  37.521  on 13  degrees of freedom
AIC: 79.597

Number of Fisher Scoring iterations: 5

You can also create the response in two steps, first working out a column with the number of mobile daphnids in it (using mutate and total - no), and then gluing those two columns together into a matrix.

If you can get to this point via a method like this, you get the points here, but you will have a lot of trouble drawing the plot later.

Here is another way to create the response:

response <- with(daphnids, cbind(no, total - no))
response

      no   
 [1,]  0 20
 [2,]  1 19
 [3,]  3 17
 [4,]  3 17
 [5,]  5 15
 [6,]  6 14
 [7,]  7 13
 [8,] 17  3
 [9,]  0 20
[10,]  0 20
[11,]  6 14
[12,]  8 12
[13,] 11  9
[14,] 16  4
[15,] 18  2
[16,] 20  0

The response, being a matrix rather than a dataframe, displays all of itself rather than just the first ten rows.

Two points for fitting the model, with the correct kind of response, and one for displaying the results.

Extra 1:

The model we fitted assumes that the effect of dose is the same for both times (or, the effect of time is the same for all doses). You might remember from your study of ANOVA that the way you allow the effect of one explanatory variable to depend on the level of another is to add an interaction term, which you can do as below.¹

daphnids.2 <- glm(cbind(no, total - no) ~ dose * time, data = daphnids, 
                  family = "binomial")
summary(daphnids.2)


Call:
glm(formula = cbind(no, total - no) ~ dose * time, family = "binomial", 
    data = daphnids)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.445e+00  3.352e-01  -7.293 3.03e-13 ***
dose          4.134e-04  7.109e-05   5.816 6.04e-09 ***
time48h       3.192e-01  4.887e-01   0.653  0.51364    
dose:time48h  6.086e-04  1.932e-04   3.151  0.00163 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 181.017  on 15  degrees of freedom
Residual deviance:  24.542  on 12  degrees of freedom
AIC: 68.619

Number of Fisher Scoring iterations: 5

drop1(daphnids.2, test = "Chisq")

As it turns out, the interaction term is significant, best seen from the drop1 output, so the effect of dose is not the same for both times. We will see in a graph (in a later Extra) what the difference is.

(2 points) From your output in question 8, interpret the number in the time row and the Estimate column, as far as you are able.

On my output, this is 1.625e+00, which is just 1.625.²

From here, you have a choice of two ways to go: either you can note that this number is positive and interpret that,³ or you can interpret the number itself, but then you have to get it right.

Hence, one of these:

time is a categorical variable with baseline 24h. The Estimate for time48h is positive, so a daphnid that is exposed to the toxic substance for 48 hours has a higher probability of being immobile than if it has only been exposed for 24 hours, holding dose constant.
time is a categorical variable with baseline 24h. The Estimate for time48h is 1.625, so a daphnid that is exposed to the toxic substance for 48 hours has a log-odds of being immobile that is 1.625 higher than if it has only been exposed for 24 hours, holding dose constant.

My intuition about log-odds suggests that this is a pretty big change (it’s actually the difference between probabilities of 0.5 and 0.84), but I don’t know how good yours is, so I am perfectly happy with the first interpretation, which gets at the point required. When you draw your graph later, you can check back and see whether your graph tells you the same story.

(2 points) Plot the predicted probabilities that a daphnid is immobile against dose and time, with confidence intervals for each probability.

This is plot_predictions from marginaleffects. Put the two explanatory variables inside a c in condition (in quotes), and put the quantitative one dose first to get the best graph:

plot_predictions(daphnids.1, condition = c("dose", "time"))

If you created a response variable outside the dataframe, this will happen to you:

plot_predictions(daphnids.1a, condition = c("dose", "time"))

Error in `model_data[, rn, drop = FALSE]`:
! Can't subset columns that don't exist.
✖ Column `response` doesn't exist.

The column response doesn’t exist in the dataframe, which is why plot_predictions couldn’t find it.

There is a way around this, but it is a lot of work. This involves getting some predictions and then plotting them. First, you have to pick some doses and the two times, and make a dataframe out of the combinations of them:

new <- datagrid(model = daphnids.1a, dose = seq(0, 10000, 1000), time = c("24h", "48h"))
new

Then get hold of the predictions for these combinations, saving the predictions and confidence limits (since you need them for the plot):

cbind(predictions(model = daphnids.1a, newdata = new)) %>% 
  select(dose, time, estimate, conf.low, conf.high)

and finally plot them. You probably will have to look up (or read the worksheet 2 solutions carefully for) how to plot the confidence intervals using geom_ribbon:

cbind(predictions(model = daphnids.1a, newdata = new)) %>% 
  select(dose, time, estimate, conf.low, conf.high) %>% 
  ggplot(aes(x = dose, fill = time, y = estimate, ymin = conf.low, ymax = conf.high)) +
    geom_line() + geom_ribbon(alpha = 0.3)

That was a lot of work for two points.

Extra: the same code makes the corresponding plot for the model with interaction, the model I called daphnids.2:

plot_predictions(daphnids.2, condition = c("dose", "time"))

This plot is different, because there actually was an interaction (that I didn’t ask you to fit). I discuss the interpretations of these graphs below. One thing that belongs here, though: you might remember from your studies of ANOVA, or of regression with categorical variables (as in C32), that when there is no interaction, plots of predictions against the explanatory variables make parallel lines (a so-called “interaction plot”). The predictions on the first of my two plots here don’t make parallel lines, but if we had done them on the log-odds scale they would have done. The intuition here from my (first) plot is that the probabilities go up in “the same kind of way” in that the two trends more or less track each other. On my second plot, that is clearly not the case, and then you would rightly suspect that there is a significant interaction.

(2 points) What does your plot from question 10 tell you about the relationship between the probability of a daphnid being immobile as it depends on dose and time?

I repeat my plot (because I have another one, but you don’t need to):

plot_predictions(daphnids.1, condition = c("dose", "time"))

Take the two explanatory variables individually (and imagine the other one being fixed):

as dose increases, the probability of a daphnid being immobile also increases (at either time point)
The probability of a daphnid being immobile is higher when it is left in the toxic substance for 48 hours, as compared to 24 hours, and this is true for any dose. (This is what you (hopefully) concluded earlier; if it isn’t, you are invited to think about why not.)

That’s enough for the 2 points (since I asked you only about how the probability changes). You might also consider the lengths of the confidence intervals. With 20 daphnids at each dose/time combination, there is a decent amount of information about the relationship between these variables and being mobile or immobile. Also, if you look back at the data, the relationship is pretty clear on both counts: the number of immobile daphnids increases from 0 to 20 (or near) as the dose increases, and if you compare 24 and 48 hours for the same dose, there are more immobile daphnids at the same dose. Hence, we would expect to see the short confidence intervals (and strong significance) that we do.

Of course, if you were not able to draw a plot, you were not able to answer this question.

Extra: the graph for the model with interaction was different:

plot_predictions(daphnids.2, condition = c("dose", "time"))

The interaction was significant, and that shows up in the form of the relationship between dose and probability of immobile being different for the two times. A small or medium dose has a much greater probabability of making the daphnid immobile when the daphnid was in the toxic substance for 48 hours, compared to 24 hours. The effect of time was much smaller at larger doses (almost all of the daphnids were immobile at the end anyway when the dose was larger).

Footnotes

We see this again in this course in the section “ANOVA revisited”.↩︎
R chose to use scientific notation for all the numbers in this column, or at least it did for me.↩︎
That’s the reason for the wording at the end of the question.↩︎