library(tidyverse)
library(marginaleffects)
Worksheet 2
Packages
Low birth weight
Low birth weight, defined as a baby that weighs less than 2500 grams when it is born, is an outcome that is of concern because infant mortality rates and birth defect rates are very high for low birth weight babies. The mother’s behaviour during pregnancy is believed to have a great effect on whether the baby is of normal or low birth weight.
The variables of interest to us are:
low
:underweight
(low birth weight, under 2500 g) ornormalweight
(normal birth weight, 2500 g or larger).lwt
: the mother’s weight at her last menstrual period (in pounds)smoke
: whether or not the mother smoked during the pregnancy (Yes
orNo
).
The data, with these variables and a number of others, are in http://ritsokiguess.site/datafiles/lowbwt.csv.
- Read in and display some of the data.
As usual:
<- "http://ritsokiguess.site/datafiles/lowbwt.csv"
my_url <- read_csv(my_url) birth_weights
Rows: 189 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): low, race, smoke, ptl, ht, ui, ftv
dbl (4): id, age, lwt, bwt
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
birth_weights
Extra: if you wanted to, you could focus on the variables of interest to us, and check that they have sensible values:
%>% select(low, lwt, smoke) birth_weights
and they do indeed look sensible (the weights look reasonable for women, in pounds, at the very start of the pregnancy).
- Fit a logistic regression predicting whether or not the baby is of low birth weight, as it depends on the mother’s weight at last menstrual period and whether or not the mother smoked. Display the results. Hint: the response variable is not zero and one, so it needs to be a
factor
in the model.
These variables are respectively low
, lwt
, and smoke
, with the first needing to be factor(low)
:
.1 <- glm(factor(low) ~ lwt + smoke, data = birth_weights, family = "binomial")
lowsummary(low.1)
Call:
glm(formula = factor(low) ~ lwt + smoke, family = "binomial",
data = birth_weights)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.62200 0.79592 0.781 0.4345
lwt -0.01332 0.00609 -2.188 0.0287 *
smokeYes 0.67667 0.32470 2.084 0.0372 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 224.34 on 186 degrees of freedom
AIC: 230.34
Number of Fisher Scoring iterations: 4
I chose these variables because they were of scientific interest, and as it turns out (see later), they are worth keeping in the model.
- How do you know that your model is predicting the probability of a low birth weight baby, as opposed to a normal birth weight baby? Explain briefly.
The two categories of low
are normalweight
and underweight
. The first of these (alphabetically), normalweight
, is the baseline, and we predict the probability of the second one, underweight
.
Extra: I had to play with the data to make this work for you; I eventually decided that calling the low birth weight category “underweight” would make it the second one alphabetically. Having a variable called low
not come out as modelling the probability of a low birth weight would be too confusing!
- Should either of the explanatory variables be removed from the logistic regression? Explain briefly.
The very brief answer is no, because they are both significant (and therefore removing either of them will make the model fit worse).
A perhaps better way to verify this is to note that smoke
is categorical, and so we really ought to use drop1
:
drop1(low.1, test = "Chisq")
Something odd appears to have happened here, which I ought to explain. When you do a regular regression, the output from drop1
and summary
are the same if your categorical variable has two categories (as smoke
does here). But when you venture outside the world of regular regression, the tests in summary
(known as Wald tests in this context) and drop1
(likelihood ratio tests) are not the same, even when you think they should be. They are only “asymptotically equivalent”, meaning that if you had an infinitely large sample size they would be the same, but in the actual real world of finite sample sizes they will be different. The consolation is that if you have a “reasonably large” sample size, the P-values will not be very different, and in almost all cases you will end up drawing the same conclusion, which is what happened here. Our sample size of \(n = 189\) was on the large side, though for a logistic regression this is not a huge sample.
One more thing to mention: when you are using drop1
, the test
to use depends on whether you are fitting something like a regression that is done by least squares (use test = "F"
), or whether you are fitting some other model that uses maximum likelihood, like this one (use test = "Chisq"
).
Anyway, the fact that I didn’t ask you to fit a better model is a rather large hint here about what to expect!
- Make a plot showing the fitted probability of a baby being of low birth weight as it depends on the mother’s weight at last menstrual period and whether or not the mother smokes. Hint: this is one line of code, using something from the
marginaleffects
package. Put the quantitative explanatory variable first.
Feed plot_predictions
your fitted model, and a thing called condition
that contains the two explanatory variables (in quotes), lwt
first:
plot_predictions(low.1, condition = c("lwt", "smoke"))
For this, there is no need to set up a dataframe called new
with some values to predict for; plot_predictions
decides for itself what values of lwt
to use, and uses enough of them to give you some nice smooth curves.
Extra 1: there are red and blue “envelopes” around the predictions (that give confidence intervals for the probability of being low birth weight as it depends on lwt
and smoke
. The grey area is where the red and blue envelopes overlap; even though this area is fairly substantial, there actually is a significant difference between smoking and not.)
Extra 2: if you put smoke
first, you get a plot that is much harder to interpret:
plot_predictions(low.1, condition = c("smoke", "lwt"))
It seems to have turned lwt
into a categorical variable as well, and chosen five representative values of it.
- In your plot, describe the effects of the two explanatory variables on the probability of a low birth weight baby.
The effect of smoking is as you would expect: for any mother’s weight (before pregnancy), the probability of a low birth weight baby is higher if the mother smokes during pregnancy (blue curve), compared to not doing so (red curve).
The effect of mother’s weight (before pregnancy) may not be what you were expecting: the probability of a low birth weight baby is lower if the mother is heavier.
I guess the mother being heavier before pregnancy might be associated with other problems during pregnancy, but the baby being of low birth weight is not one of them.
Grain beetles
A number of grain beetles were exposed to ethylene oxide at one of ten different concentrations (in mg/l), in column conc
. For each concentration, the number of beetles affected, and the total number exposed to that concentration, were recorded. The data are in http://ritsokiguess.site/datafiles/beetle.csv. Our aim is to see whether a beetle being affected or not depends on the concentration of ethylene oxide.
- Read in and display the data. (You should see all of the data values this time.)
As usual:
<- "http://ritsokiguess.site/datafiles/beetle.csv"
my_url <- read_csv(my_url) beetle
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): conc, affected, exposed
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
beetle
The assumption hiding in the background is that each beetle was exposed to only one of the concentrations, and that beetle is either affected or not. If a beetle had participated several times at different concentrations, we would have some kind of repeated measures analysis, which would be a good deal messier than the analysis we are about to do.
- In this dataframe, does each row refer to one beetle or more than one beetle? Explain briefly.
Each row refers to more than one beetle, typically about 30 beetles (in the column exposed
). A second clue is that we have a number affected in the dataframe, rather than an indication that a particular beetle was affected or not affected.
- Fit a suitable logistic regression for predicting the probability that a beetle is affected, as it depends on the concentration of ethylene oxide, and display the output from the logistic regression.
Since each row refers to more than one beetle, we first need to obtain a response matrix (to use as the response in our logistic regression). This means that we need a column with the number affected (at each concentration) and another column with the number not affected, which we will have to compute.
I think the easiest way to do this is to compute the response matrix inside the glm
. Make sure that the first column is the number affected (so that, in the logistic regression, you will be estimating the probability of being affected).
.1 <- glm(cbind(affected, exposed - affected) ~ conc, data = beetle, family = "binomial")
beetlesummary(beetle.1)
Call:
glm(formula = cbind(affected, exposed - affected) ~ conc, family = "binomial",
data = beetle)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.01047 0.77937 -7.712 1.24e-14 ***
conc 0.34127 0.04153 8.217 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.001 on 9 degrees of freedom
Residual deviance: 37.697 on 8 degrees of freedom
AIC: 69.27
Number of Fisher Scoring iterations: 5
Alternatively, you can compute a response matrix outside the dataframe. To do that, you can keep everything in the tidyverse until right at the end, when we turn it into a matrix
:
%>% mutate(not_affected = exposed - affected) %>%
beetle select(affected, not_affected) %>%
as.matrix() -> response
response
affected not_affected
[1,] 23 7
[2,] 30 0
[3,] 29 2
[4,] 22 8
[5,] 23 3
[6,] 7 20
[7,] 12 19
[8,] 17 13
[9,] 10 21
[10,] 0 24
Then we use this on the left-hand side of the squiggle in the logistic regression:
.1a <- glm(response ~ conc, data = beetle, family = "binomial")
beetlesummary(beetle.1a)
Call:
glm(formula = response ~ conc, family = "binomial", data = beetle)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.01047 0.77937 -7.712 1.24e-14 ***
conc 0.34127 0.04153 8.217 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.001 on 9 degrees of freedom
Residual deviance: 37.697 on 8 degrees of freedom
AIC: 69.27
Number of Fisher Scoring iterations: 5
to get to the same place. The problem with doing it this way is that it makes it more difficult to plot the predictions; doing it the first way makes it straightforward.
- Is there a significant effect of concentration? If there is, does a larger concentration go with a larger or smaller probability of being affected? Explain briefly.
The P-value is extremely small (less than \(2.2 \times 10^{-16}\)), so there is definitely a significant effect of concentration on the probability of being affected. The Estimate (0.34) is positive, so a larger concentration goes with a larger probability of being affected.
That’s all you need to say. Some additional thoughts:
To confirm the direction of the effect, look back at the data. Most of the affected beetles go with the highest concentrations (the concentrations are in descending order). So a higher concentration should go with a higher chance of being affected.
Be careful not to over-interpret this, or to claim that the Estimate being 0.34 has a direct effect on the probability. It doesn’t: increasing the concentration by 1 increases the log-odds of being affected by 0.34. See below for more discussion of this.
If you think about it, the 0.34 cannot have a direct effect on the probability, because pretty soon the predicted probability will be greater than 1, and the whole point of a logistic regression is to predict a probability that makes sense as a probability.
- For concentrations 15, 20, and 25, use your model to predict the probability of being affected. Clearly display your predictions along with the concentrations they are predictions for. Are the predictions consistent with what you said in the previous part? Explain briefly.
This uses the marginaleffects
package. We are predicting for some concentrations not in the original dataframe, so we need to make a dataframe of those first (by my tradition called new
):
<- tibble(conc = c(15, 20, 25)) new
(one point for this), and then we make the predictions. There are rather a lot of columns in the output, so select
the ones you want your reader to see:1
cbind(predictions(model = beetle.1, newdata = new)) %>%
select(conc, estimate)
The predicted probabilities of being affected increase as the concentration increases, which is what we said in the previous part (it is just a different way to see the same thing).
Extra: The probabilities are not going up linearly. The jump in predicted probability between 15 and 20 is about 0.4, but the jump between 20 and 25 is less than 0.3. We can turn those estimates into log-odds and see what happens then:
cbind(predictions(model = beetle.1, newdata = new)) %>%
select(conc, estimate) %>%
mutate(odds = estimate / (1 - estimate)) %>%
mutate(log_odds = log(odds))
With each increase in 5 of concentration, the log-odds increases by about 1.7, which is (to within our rounding) the same as
5 * 0.34127
[1] 1.70635
That’s the actual meaning of “slope” in this context.
- Make a plot showing the predicted probability of being affected for concentrations covering the range of the data. On your plot, show confidence intervals for the probability for each concentration.
Your first thought is probably plot_predictions
, which does exactly this, but only as long as you fitted the model with the cbind
inside the glm
:
plot_predictions(model = beetle.1, condition = "conc")
If you constructed the response matrix first and then try to do this, this is what happens:
plot_predictions(model = beetle.1a, condition = "conc")
Error in `model_data[, rn, drop = FALSE]`:
! Can't subset columns that don't exist.
✖ Column `response` doesn't exist.
The column response
exists all right, but only in our workspace, not in the dataframe that beetle.1
was fitted for. That is the reason for the error message. It is perhaps odd that predictions
works but plot_predictions
does not. I don’t think either of them are actually using the information in response
. But so it goes.
So, now we need to build this ourselves. The first thing is to do a bigger set of predictions. Our data go from concentrations of about 10 to about 25, so let’s do a bunch of predictions this time, say from 10 to 25 in steps of 1 (you can do whatever you like, but if you have a denser grid of points, it may take longer to compute. On the other hand, if you have a less dense grid of points, the graph won’t look as nice):
<- tibble(conc = seq(10, 25, 1))
new cbind(predictions(model = beetle.1, newdata = new))
Then plot the Estimates joined by lines, and use geom_ribbon
to display the confidence intervals about the predictions:
<- tibble(conc = seq(10, 25, 1))
cbnew cbind(predictions(model = beetle.1, newdata = new)) %>%
ggplot(aes(x = conc, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_line() + geom_ribbon(alpha = 0.3)
This took me a few goes to get right, because there are several things to keep in mind:
geom_ribbon
requires aymin
for the lower end of the “ribbon” and aymax
for the upper end (noty_min
andy_max
as I first thought)the confidence limits in the output from
predictions
are calledconf.low
andconf.high
with dots in themthe plot looks nicer without plotting the predictions themselves (I ended up with no
geom_point
).the first time I plotted the ribbon, it was solid black, overwriting everything else. At that point, I remembered that you have to make the ribbon see-through by using an
alpha
of less than 1 ingeom_ribbon
. The exact value is not important; the key is that you can see the curve marking the predictions, and that the ribbon is dark enough that you can distinguish it from the background of the graph.
All of this is to say that you should never be afraid of making errors, because the error messages that come back will often tell you how to fix up what you did. (This was especially true for the message that told me to use ymin
and ymax
. The Tidyverse people work hard on making informative error messages, but that all depends on your reading the message that comes back after an error!)
- On your graph, would you say that the probabilities are accurately estimated, or not? Explain briefly.
I would say that they are accurately estimated, because the ribbon (confidence intervals around the predictions) is narrow. Some of the other corresponding plots we’ve seen have the probabilities estimated much worse than this.
Have an opinion and defend it. If you really want to call this ribbon “wide”, do it if you can make it make sense.
If you didn’t get as far as making the graph, you can still answer this part by going back to your predictions and looking at some more columns:
<- tibble(conc = c(15, 20, 25))
new cbind(predictions(model = beetle.1, newdata = new)) %>%
select(conc, estimate, conf.low, conf.high)
Then make a call about whether you think these predicted probabilities have short confidence intervals.
Extra: My take, that the ribbon is narrow, is also supported by the sample size. How many beetles were there altogether?
%>% summarize(total_beetles = sum(exposed)) beetle
Even though each individual beetle doesn’t contribute very much (it was either affected or not affected), there are a lot of beetles here, and so we should have a good sense of how being affected depends on concentration.
Footnotes
When this was graded, you needed to make sure at least that these two columns were visible to the grader without scrolling across. Making your reader do unnecessary work is not going to impress them!↩︎