STAD29 Assignment 2

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Low birth weight

Low birth weight, defined as a baby that weighs less than 2500 grams when it is born, is an outcome that is of concern because infant mortality rates and birth defect rates are very high for low birth weight babies. The mother’s behaviour during pregnancy is believed to have a great effect on whether the baby is of normal or low birth weight.

The variables of interest to us are:

  • low: underweight (low birth weight, under 2500 g) or normalweight (normal birth weight, 2500 g or larger).
  • lwt: the mother’s weight at her last menstrual period (in pounds)
  • smoke: whether or not the mother smoked during the pregnancy (Yes or No).

The data, with these variables and a number of others, are in http://ritsokiguess.site/datafiles/lowbwt.csv.

(a) (1 point) Read in and display some of the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/lowbwt.csv"
birth_weights <- read_csv(my_url)
Rows: 189 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): low, race, smoke, ptl, ht, ui, ftv
dbl (4): id, age, lwt, bwt

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
birth_weights

Extra: if you wanted to, you could focus on the variables of interest to us, and check that they have sensible values:

birth_weights %>% select(low, lwt, smoke)

and they do indeed look sensible (the weights look reasonable for women, in pounds, at the very start of the pregnancy).

(b) (3 points) Fit a logistic regression predicting whether or not the baby is of low birth weight, as it depends on the mother’s weight at last menstrual period and whether or not the mother smoked. Display the results. Hint: the response variable is not zero and one, so it needs to be a factor in the model.

These variables are respectively low, lwt, and smoke, with the first needing to be factor(low):

low.1 <- glm(factor(low) ~ lwt + smoke, data = birth_weights, family = "binomial")
summary(low.1)

Call:
glm(formula = factor(low) ~ lwt + smoke, family = "binomial", 
    data = birth_weights)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.62200    0.79592   0.781   0.4345  
lwt         -0.01332    0.00609  -2.188   0.0287 *
smokeYes     0.67667    0.32470   2.084   0.0372 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 224.34  on 186  degrees of freedom
AIC: 230.34

Number of Fisher Scoring iterations: 4

I chose these variables because they were of scientific interest, and as it turns out (see next part), they are worth keeping in the model.

(c) (2 points) How do you know that your model predicting the probability of a low birth weight baby, as opposed to a normal birth weight baby? Explain briefly.

The two categories of low are normalweight and underweight. The first of these (alphabetically), normalweight, is the baseline, and we predict the probability of the second one, underweight.

Extra: I had to play with the data to make this work for you; I eventually decided that calling the low birth weight category “underweight” would make it the second one alphabetically. Having a variable called low not come out as modelling the probability of a low birth weight would be too confusing!

(d) (2 points) Should either of the explanatory variables be removed from the logistic regression? Explain briefly.

The very brief answer is no, because they are both significant (and therefore removing either of them will make the model fit worse).

The fact that I didn’t ask you to fit a better model is a rather large hint here!

(e) (3 points) Make a plot showing the fitted probability of a baby being of low birth weight as it depends on the mother’s weight at last menstrual period and whether or not the mother smokes. Hint: this is one line of code, using something from the marginaleffects package. Put the quantitative explanatory variable first.

Feed plot_predictions your fitted model, and a thing called condition that contains the two explanatory variables (in quotes), lwt first:

plot_predictions(low.1, condition = c("lwt", "smoke"))