STAD29 Assignment 2

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Low birth weight

Low birth weight, defined as a baby that weighs less than 2500 grams when it is born, is an outcome that is of concern because infant mortality rates and birth defect rates are very high for low birth weight babies. The mother’s behaviour during pregnancy is believed to have a great effect on whether the baby is of normal or low birth weight.

The variables of interest to us are:

low: underweight (low birth weight, under 2500 g) or normalweight (normal birth weight, 2500 g or larger).
lwt: the mother’s weight at her last menstrual period (in pounds)
smoke: whether or not the mother smoked during the pregnancy (Yes or No).

The data, with these variables and a number of others, are in http://ritsokiguess.site/datafiles/lowbwt.csv.

(a) (1 point) Read in and display some of the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/lowbwt.csv"
birth_weights <- read_csv(my_url)

Rows: 189 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): low, race, smoke, ptl, ht, ui, ftv
dbl (4): id, age, lwt, bwt

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

birth_weights

Extra: if you wanted to, you could focus on the variables of interest to us, and check that they have sensible values:

birth_weights %>% select(low, lwt, smoke)

and they do indeed look sensible (the weights look reasonable for women, in pounds, at the very start of the pregnancy).

(b) (3 points) Fit a logistic regression predicting whether or not the baby is of low birth weight, as it depends on the mother’s weight at last menstrual period and whether or not the mother smoked. Display the results. Hint: the response variable is not zero and one, so it needs to be a `factor` in the model.

These variables are respectively low, lwt, and smoke, with the first needing to be factor(low):

low.1 <- glm(factor(low) ~ lwt + smoke, data = birth_weights, family = "binomial")
summary(low.1)


Call:
glm(formula = factor(low) ~ lwt + smoke, family = "binomial", 
    data = birth_weights)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.62200    0.79592   0.781   0.4345  
lwt         -0.01332    0.00609  -2.188   0.0287 *
smokeYes     0.67667    0.32470   2.084   0.0372 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 224.34  on 186  degrees of freedom
AIC: 230.34

Number of Fisher Scoring iterations: 4

I chose these variables because they were of scientific interest, and as it turns out (see next part), they are worth keeping in the model.

(c) (2 points) How do you know that your model predicting the probability of a low birth weight baby, as opposed to a normal birth weight baby? Explain briefly.

The two categories of low are normalweight and underweight. The first of these (alphabetically), normalweight, is the baseline, and we predict the probability of the second one, underweight.

Extra: I had to play with the data to make this work for you; I eventually decided that calling the low birth weight category “underweight” would make it the second one alphabetically. Having a variable called low not come out as modelling the probability of a low birth weight would be too confusing!

(d) (2 points) Should either of the explanatory variables be removed from the logistic regression? Explain briefly.

The very brief answer is no, because they are both significant (and therefore removing either of them will make the model fit worse).

The fact that I didn’t ask you to fit a better model is a rather large hint here!

(e) (3 points) Make a plot showing the fitted probability of a baby being of low birth weight as it depends on the mother’s weight at last menstrual period and whether or not the mother smokes. Hint: this is one line of code, using something from the `marginaleffects` package. Put the quantitative explanatory variable first.

Feed plot_predictions your fitted model, and a thing called condition that contains the two explanatory variables (in quotes), lwt first:

plot_predictions(low.1, condition = c("lwt", "smoke"))

For this, there is no need to set up a dataframe called new with some values to predict for; plot_predictions decides for itself what values of lwt to use, and uses enough of them to give you some nice smooth curves.

Extra 1: there are red and blue “envelopes” around the predictions (that give confidence intervals for the probability of being low birth weight as it depends on lwt and smoke. The grey area is where the red and blue envelopes overlap; even though this area is fairly substantial, there actually is a significant difference between smoking and not.)

Extra 2: if you put smoke first, you get a plot that is much harder to interpret:

plot_predictions(low.1, condition = c("smoke", "lwt"))

It seems to have turned lwt into a categorical variable as well, and chosen five representative values of it.

(f) (2 points) In your plot, describe the effects of the two explanatory variables on the probability of a low birth weight baby.

The effect of smoking is as you would expect: for any mother’s weight (before pregnancy), the probability of a low birth weight baby is higher if the mother smokes during pregnancy (blue curve), compared to not doing so (red curve).
The effect of mother’s weight (before pregnancy) may not be what you were expecting: the probability of a low birth weight baby is lower if the mother is heavier.

I guess the mother being heavier before pregnancy might be associated with other problems during pregnancy, but the baby being of low birth weight is not one of them.

2 Grain beetles

A number of grain beetles were exposed to ethylene oxide at one of ten different concentrations (in mg/l). For each dose, the number of beetles affected, and the number exposed to that dose, were recorded. The data are in http://ritsokiguess.site/datafiles/beetle.csv. Our aim is to see whether a beetle being affected or not depends on the concentration of ethylene oxide.

(a) (1 point) Read in and display the data. (You should see all of the data values this time.)

As usual:

my_url <- "http://ritsokiguess.site/datafiles/beetle.csv"
beetle <- read_csv(my_url)

Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): conc, affected, exposed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

beetle

The assumption hiding in the background is that each beetle was exposed to only one of the concentrations, and that beetle is either affected or not. If a beetle had participated several times at different concentrations, we would have some kind of repeated measures analysis, which would be a good deal messier than the analysis we are about to do.

(b) (2 points) In this dataframe, does each row refer to one beetle or more than one beetle? Explain briefly.

Each row refers to more than one beetle, typically about 30 beetles (in the column exposed). A second clue is that we have a number affected in the dataframe, rather than an indication that a particular beetle was affected or not affected.

(c) (4 points) Fit a suitable logistic regression for predicting the probability that a beetle is affected, as it depends on the concentration of ethylene oxide, and display the output from the logistic regression.

Since each row refers to more than one beetle, we first need to obtain a response matrix (to use as the response in our logistic regression). This means that we need a column with the number affected (at each concentration) and another column with the number not affected, which we will have to compute.

I think the easiest way to do this is to keep everything in the tidyverse until right at the end, when we turn it into a matrix:

beetle %>% mutate(not_affected = exposed - affected) %>% 
  select(affected, not_affected) %>% 
  as.matrix() -> response
response

      affected not_affected
 [1,]       23            7
 [2,]       30            0
 [3,]       29            2
 [4,]       22            8
 [5,]       23            3
 [6,]        7           20
 [7,]       12           19
 [8,]       17           13
 [9,]       10           21
[10,]        0           24

Make sure that the first column is the number affected (so that, in the logistic regression, you will be estimating the probability of being affected).

Two points for that much.

Then we use this on the left-hand side of the squiggle in the logistic regression:

beetle.1 <- glm(response ~ conc, data = beetle, family = "binomial")
summary(beetle.1)


Call:
glm(formula = response ~ conc, family = "binomial", data = beetle)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.01047    0.77937  -7.712 1.24e-14 ***
conc         0.34127    0.04153   8.217  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.001  on 9  degrees of freedom
Residual deviance:  37.697  on 8  degrees of freedom
AIC: 69.27

Number of Fisher Scoring iterations: 5

The other two points for this.

(d) (2 points) Is there a significant effect of concentration? If there is, does a larger concentration go with a larger or smaller probability of being affected? Explain briefly.

The P-value is extremely small (less than \(2.2 \times 10^{-16}\)), so there is definitely a significant effect of concentration on the probability of being affected. The Estimate (0.34) is positive, so a larger concentration goes with a larger probability of being affected.

That’s all you need to say. Some additional thoughts:

To confirm the direction of the effect, look back at the data. Most of the affected beetles are at the top, which goes with the highest concentrations (the concentrations are in descending order). So a higher concentration should go with a higher chance of being affected.

Be careful not to over-interpret this, or to claim that the Estimate being 0.34 has a direct effect on the probability. It doesn’t: increasing the concentration by 1 increases the log-odds of being affected by 0.34. See below for more discussion of this.

If you think about it, the 0.34 cannot have a direct effect on the probability, because pretty soon the predicted probability will be greater than 1, and the whole point of a logistic regression is to predict a probability that makes sense as a probability.

(e) (3 points) For concentrations 15, 20, and 25, use your model to predict the probability of being affected. Clearly display your predictions along with the concentrations they are predictions for. Are the predictions consistent with what you said in the previous part? Explain briefly.

This uses the marginaleffects package. We are predicting for some new concentrations, so we need to make a dataframe of those first (by my tradition called new):

new <- tibble(conc = c(15, 20, 25))

(one point for this), and then we make the predictions. There are rather a lot of columns in the output, so select the ones you want your reader to see:¹

cbind(predictions(model = beetle.1, newdata = new)) %>% 
  select(conc, estimate)

The predicted probabilities of being affected increase as the concentration increases, which is what we said in the previous part (it is just a different way to see the same thing).

Extra: The probabilities are not going up linearly. The jump in predicted probability between 15 and 20 is about 0.4, but the jump between 20 and 25 is less than 0.3. We can turn those estimates into log-odds and see what happens then:

cbind(predictions(model = beetle.1, newdata = new)) %>% 
  select(conc, estimate) %>% 
  mutate(odds = estimate / (1 - estimate)) %>% 
  mutate(log_odds = log(odds))

With each increase in 5 of concentration, the log-odds increases by about 1.7, which is (to within our rounding) the same as

5 * 0.34127

[1] 1.70635

That’s the actual meaning of “slope” in this context.

(f) (3 points) Make a plot showing the predicted probability of being affected for doses covering the range of the data. On your plot, show confidence intervals for the probability for each dose. (You will have to build this plot yourself because your response variable was not in the original dataframe.)

Your first thought is probably plot_predictions, which does exactly this, but which doesn’t work here:

plot_predictions(model = beetle.1, condition = "conc")

Error in `model_data[, rn, drop = FALSE]`:
! Can't subset columns that don't exist.
✖ Column `response` doesn't exist.

The column response exists all right, but only in our workspace, not in the dataframe that beetle.1 was fitted for. That is the reason for the error message. It is perhaps odd that predictions works but plot_predictions does not. I don’t think either of them are actually using the information in response. But so it goes.

So, let’s build this ourselves. The first thing is to do a bigger set of predictions. Our data go from concentrations of about 10 to about 25, so let’s do a bunch of predictions this time, say from 10 to 25 in steps of 1 (you can do whatever you like, but if you have a denser grid of points, it may take longer to compute. On the other hand, if you have a less dense grid of points, the graph won’t look as nice):

new <- tibble(conc = seq(10, 25, 1))
cbind(predictions(model = beetle.1, newdata = new))

Then plot the Estimates joined by lines, and use geom_ribbon to display the confidence intervals about the predictions:

cbnew <- tibble(conc = seq(10, 25, 1))
cbind(predictions(model = beetle.1, newdata = new)) %>% 
  ggplot(aes(x = conc, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_line() + geom_ribbon(alpha = 0.3)

This took me a few goes to get right, because there are several things to keep in mind:

geom_ribbon requires a ymin for the lower end of the “ribbon” and a ymax for the upper end (not y_min and y_max as I first thought)
the confidence limits in the output from predictions are called conf.low and conf.high with dots in them
the plot looks nicer without plotting the predictions themselves (I ended up with no geom_point).
the first time I plotted the ribbon, it was solid black, overwriting everything else. At that point, I remembered that you have to make the ribbon see-through by using an alpha of less than 1 in geom_ribbon. The exact value is not important; the key is that you can see the curve marking the predictions, and that the ribbon is dark enough that you can distinguish it from the background of the graph.

All of this is to say that you should never be afraid of making errors, because the error messages that come back will often tell you how to fix up what you did. (This was especially true for the message that told me to use ymin and ymax. The Tidyverse people work hard on making informative error messages, but that all depends on your reading the message that comes back after an error!)

(g) (2 points) On your graph, would you say that the probabilities are accurately estimated, or not? Explain briefly.

I would say that they are accurately estimated, because the ribbon (confidence intervals around the predictions) is narrow. Some of the other corresponding plots we’ve seen have the probabilities estimated much worse than this.

Have an opinion and defend it. If you really want to call this ribbon “wide”, do it if you can make it make sense.

If you didn’t get as far as making the graph, you can still answer this part by going back to your predictions and looking at some more columns:

new <- tibble(conc = c(15, 20, 25))
cbind(predictions(model = beetle.1, newdata = new)) %>% 
  select(conc, estimate, conf.low, conf.high)

Then make a call about whether you think these predicted probabilities have short confidence intervals.

Extra: My take, that the ribbon is narrow, is also supported by the sample size. How many beetles were there altogether?

beetle %>% summarize(total_beetles = sum(exposed))

Even though each individual beetle doesn’t contribute very much (it was either affected or not affected), there are a lot of beetles here, and so we should have a good sense of how being affected depends on concentration.

Footnotes

Make sure at least that these two columns are visible to the grader without scrolling across. Making your reader do unnecessary work is not going to impress them!↩︎