STAD29 Assignment 4

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)
library(marginaleffects)
library(nnet)

Palmer penguins

The penguins dataset in the package palmerpenguins contains body measurements and other information for 344 adult penguins that were observed at the Palmer Archipelago in Antarctica. Variables of interest to us are:

  • species: the species of the penguin (one of Adelie, Chinstrap, Gentoo)
  • bill_length_mm: bill length in mm
  • bill_depth_mm: bill depth (from top to bottom) in mm
  • flipper_length_mm: flipper length in mm
  • body_mass_g: body mass in grams
  • sex: whether the penguin is male or female

In particular, is it possible to predict the species of a penguin from the other variables?

  1. (1 point) Load the package and display the dataset.

That means this:

library(palmerpenguins)
penguins
  1. (2 points) There are some missing values in the dataset. Remove all the observations that have missing values on any of the variables, and save the resulting dataset.

This is what drop_na does. You can either feed it some columns to check for missings in, or if you feed it no columns at all, it will check for missings everywhere. I saved it back into the original dataset, but you can certainly create a new one if you prefer:

penguins %>% drop_na() -> penguins

If you have another way that is consistent with the way we have learned things, that is also good. (Ideas from base R are not consistent with our approach.)

Extra: to convince myself that all the missing values are indeed gone:

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009  

and none of the columns do indeed have any missing values remaining in them. There appear to be 333 penguins left,1 so we had to get rid of 11 of them.

  1. (2 points) Why is polr from MASS not appropriate for fitting a model to predict species from the other variables?

polr assumes that the categorical response variable species has ordered categories, but in this case the species are just names or labels, and there is no (unambiguous) way to put them in order (such as from “low” to “high”).

Hence, looking just ahead, we will need to use multinom from the nnet package to fit our models.

  1. (2 points) Fit a suitable model predicting species from the other variables. There is no need to display any of the output from the model at this point (the fitting process will produce some output, which is fine to include).

From the previous question, we have ruled out polr, so:

penguins.1 <- multinom(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
                         body_mass_g + sex, data = penguins)
# weights:  21 (12 variable)
initial  value 365.837892 
iter  10 value 16.719027
iter  20 value 3.047829
iter  30 value 0.103484
iter  40 value 0.063985
iter  50 value 0.006201
iter  60 value 0.003924
iter  70 value 0.001733
iter  80 value 0.001547
iter  90 value 0.001513
iter 100 value 0.001498
final  value 0.001498 
stopped after 100 iterations

Extra: should you be curious:

summary(penguins.1)
Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
    flipper_length_mm + body_mass_g + sex, data = penguins)

Coefficients:
          (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap -103.422229       16.51642     -24.41758       -0.31003929
Gentoo      -6.756321       12.21924     -30.09757       -0.08758996
           body_mass_g    sexmale
Chinstrap -0.030200928 -15.820607
Gentoo     0.005019684  -8.132101

Std. Errors:
          (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
Chinstrap   0.5715154      27.749774      27.43640          4.097365
Gentoo      0.5715734       8.968787      33.25608         11.905692
          body_mass_g  sexmale
Chinstrap   0.1350169 3.077521
Gentoo      0.5848602 3.498288

Residual Deviance: 0.002996934 
AIC: 24.003 

This doesn’t really tell us very much. The coefficients are kind of slopes (treating the first category Adelie as baseline), and there are some log-odds hiding in the background, but for knowing things of actual interest to us, like “can we remove anything from the model?”, this is not what to look at.

  1. (3 points) Use step to see whether anything can be removed from your model. Which variables remain after step has finished? (Hints: save the output from step, because this is actually the best model that step found. Also, the fitting process will probably produce a lot of output, which, for this question, you can include in your assignment.)

First off, you are probably thinking drop1, but remember that this doesn’t work for these models (no, I don’t know why not either):

drop1(penguins.1, test = "Chisq")
trying - bill_length_mm 
Error in if (trace) {: argument is not interpretable as logical

so step it is:

penguins.2 <- step(penguins.1)
Start:  AIC=24
species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    body_mass_g + sex

trying - bill_length_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 129.185715
iter  20 value 114.018556
iter  30 value 113.336460
iter  40 value 113.216108
final  value 113.215886 
converged
trying - bill_depth_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 66.463118
iter  20 value 12.514139
iter  30 value 7.743531
iter  40 value 5.662963
iter  50 value 4.350499
iter  60 value 4.145847
iter  70 value 4.093510
iter  80 value 3.984724
iter  90 value 3.903126
iter 100 value 3.831010
final  value 3.831010 
stopped after 100 iterations
trying - flipper_length_mm 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 25.561758
iter  20 value 1.350538
iter  30 value 0.065803
iter  40 value 0.020499
iter  50 value 0.016911
iter  60 value 0.012712
iter  70 value 0.005348
iter  80 value 0.001290
iter  90 value 0.000225
iter 100 value 0.000216
final  value 0.000216 
stopped after 100 iterations
trying - body_mass_g 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 9.579038
iter  20 value 4.114389
iter  30 value 2.340467
iter  40 value 0.614487
iter  50 value 0.055642
iter  60 value 0.012303
iter  70 value 0.007327
iter  80 value 0.007163
iter  90 value 0.007015
iter 100 value 0.006995
final  value 0.006995 
stopped after 100 iterations
trying - sex 
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 16.321214
iter  20 value 3.754897
iter  30 value 1.631859
iter  40 value 0.012427
iter  50 value 0.001125
iter  60 value 0.001108
iter  70 value 0.001006
iter  80 value 0.000906
iter  90 value 0.000498
iter 100 value 0.000498
final  value 0.000498 
stopped after 100 iterations
                    Df       AIC
- flipper_length_mm 10  20.00043
- sex               10  20.00100
- body_mass_g       10  20.01399
<none>              12  24.00300
- bill_depth_mm     10  27.66202
- bill_length_mm    10 246.43177
# weights:  18 (10 variable)
initial  value 365.837892 
iter  10 value 25.561758
iter  20 value 1.350538
iter  30 value 0.065803
iter  40 value 0.020499
iter  50 value 0.016911
iter  60 value 0.012712
iter  70 value 0.005348
iter  80 value 0.001290
iter  90 value 0.000225
iter 100 value 0.000216
final  value 0.000216 
stopped after 100 iterations

Step:  AIC=20
species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex

trying - bill_length_mm 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 136.772749
iter  20 value 133.782638
iter  30 value 133.545483
final  value 133.538635 
converged
trying - bill_depth_mm 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 30.179445
iter  20 value 5.699489
iter  30 value 5.175455
iter  40 value 5.153196
iter  50 value 5.070255
iter  60 value 4.997789
iter  70 value 4.893410
iter  80 value 4.826209
iter  90 value 4.708542
iter 100 value 4.695168
final  value 4.695168 
stopped after 100 iterations
trying - body_mass_g 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations
trying - sex 
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 20.098944
iter  20 value 3.621850
iter  30 value 1.160548
iter  40 value 1.108738
iter  50 value 1.063387
iter  60 value 0.958674
iter  70 value 0.926099
iter  80 value 0.837719
iter  90 value 0.797644
iter 100 value 0.741759
final  value 0.741759 
stopped after 100 iterations
                 Df       AIC
- body_mass_g     8  16.98100
- sex             8  17.48352
<none>           10  20.00043
- bill_depth_mm   8  25.39034
- bill_length_mm  8 283.07727
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations

Step:  AIC=16.98
species ~ bill_length_mm + bill_depth_mm + sex

trying - bill_length_mm 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 150.109767
iter  20 value 142.419807
iter  30 value 142.047323
final  value 142.041293 
converged
trying - bill_depth_mm 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 139.953898
iter  20 value 133.697479
iter  30 value 133.399000
iter  40 value 133.326376
final  value 133.325373 
converged
trying - sex 
# weights:  12 (6 variable)
initial  value 365.837892 
iter  10 value 26.650783
iter  20 value 23.943597
iter  30 value 23.916873
iter  40 value 23.901339
iter  50 value 23.895442
iter  60 value 23.894251
final  value 23.892065 
converged
                 Df       AIC
<none>            8  16.98100
- sex             6  59.78413
- bill_depth_mm   6 278.65075
- bill_length_mm  6 296.08259

This is where you get rather a lot of output. step has to check the effect of removing each explanatory variable from each model it fits, and multinom itself produces half a dozen or so lines of output each time it is run.

To see which variables are left, either read them off from the output here, or look at summary of your new model just enough to see which variables it has in it:

summary(penguins.2)
Call:
multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
    sex, data = penguins)

Coefficients:
          (Intercept) bill_length_mm bill_depth_mm     sexmale
Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
Gentoo       121.0476       14.80645     -44.69711  -0.1301506

Std. Errors:
          (Intercept) bill_length_mm bill_depth_mm  sexmale
Chinstrap    3.985819       9.369182      23.16685 23.33386
Gentoo       4.071031       9.413971      23.17697 43.25185

Residual Deviance: 0.9809986 
AIC: 16.981 

Whichever way you slice it, bill length, bill depth, and sex remain; flipper length and body mass have been removed.

If you are unable to get step to work in this context: show the error message you get, with an #| error: true at the top of your code chunk (so that your document will render despite the fact that there is an “error” in it), and then explicitly fit the model with sex, bill length and bill depth in it (as per the instructions on the course website):

penguins.2a <- multinom(species ~ bill_length_mm + bill_depth_mm + sex, data = penguins)
# weights:  15 (8 variable)
initial  value 365.837892 
iter  10 value 11.953428
iter  20 value 3.975600
iter  30 value 2.172365
iter  40 value 2.111553
iter  50 value 1.881400
iter  60 value 1.236767
iter  70 value 1.133364
iter  80 value 1.059253
iter  90 value 0.588576
iter 100 value 0.490499
final  value 0.490499 
stopped after 100 iterations

For the predictions below, use this model (so I have rather given away what the answer from step should be.)

  1. (3 points) Obtain predicted probabilities of each species from your best model for all nine combinations of: bill length 39.6, 40.0, and 40.4, bill depth 15.8, 16.0, and 16.2, sex female (only). Display those predicted probabilities in a way that allows for easy comparison of the three species probabilities for given values of the other variables.

Three stages, as we have seen before (a point each):

  • make a dataframe (perhaps called new) of all the combination of explanatory variables
  • do the predictions
  • rearrange the predictions so that they are easy to compare.

For the first step: datagrid does all possible combinations, so that is the way to go. When there are a lot of things that you want combinations of, it can be clearer to define them first. The best model is the one output by step, which I called penguins.2:

lengths <- c(39.6, 40.0, 40.4)
depths <- c(15.8, 16.0, 16.2)
sexes <- "female"
new <- datagrid(model = penguins.2, bill_length_mm = lengths, bill_depth_mm = depths, sex = sexes)
new

There are \(3 \times 3 \times 1 = 3^2 = 9\) combinations of values. My habit, when I define the values first like this, is to use a plural name for the values, so that when I construct new, it’s “singular = plural” all the way through.

If you prefer to put the actual values inside the datagrid, you can do that, but then you will need to split the line of code so that it is easier to see all the values, for example a new line for each variable, like this:

new0 <- datagrid(model = penguins.2, 
                 bill_length_mm = c(39.6, 40.0, 40.4),
                 bill_depth_mm = c(15.8, 16.0, 16.2),
                 sex = "female")
new0

Doing it this way, with each variable on a new line, makes it easier to read.

Second, the predictions. Save the explanatory variable names, the column group that says which species the predicted probability is for, and the column estimate of the predicted probabilities themselves:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex)

Third, pivot-wider the estimated probabilities into columns named for the species. Mine came out wider than my display, so I additionally rounded off the probabilities to make the columns narrower:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex) %>% 
  pivot_wider(names_from = group, values_from = estimate) %>% 
  mutate(across(Adelie:Gentoo, \(x) round(x, 3))) 

A second way to do this is to round off the predicted probabilities before you pivot-wider, which is easier to do but harder to see that you can do it this way:

cbind(predictions(model = penguins.2, newdata = new)) %>%
  select(group, estimate, bill_length_mm:sex) %>% 
  mutate(estimate = round(estimate, 3)) %>% 
  pivot_wider(names_from = group, values_from = estimate) 

Extra: I had to play around a bit with the numbers I wanted you to get predictions for. The problem is that the species are actually very distinct (as you will see with your graph), and so if you are not careful with your values, you will get a lot of predicted probabilities that are close to zero and one. I ended up restricting things to just females for this question because (as it turns out) males have a similar pattern but are just bigger overall, and the pattern was easier to see with just one sex.

  1. (3 points) Using your predictions, what combinations of values for bill length and bill depth distinguish the three species?

Find the largest predicted probabilities for each species and say something about whether they go with small or large values on each variable:

  • Adelie have small bill length and large bill depth
  • Chinstrap have large bill length and medium or large bill depth
  • Gentoo have small bill depth (but can have any bill length).

Another way you can go is to look at the combinations of large and small values on each explanatory variable:

  • large bill length and large bill depth is probably Chinstrap
  • large bill length and small bill depth is probably Gentoo
  • small bill length and large bill depth is (almost certainly) Adelie
  • small bill length and small bill depth is probably Gentoo (there actually aren’t really any of these, as we’ll see from a graph shortly)

Three points for something that identifies each of the three species.

Expect the grader to check your predictions (that they are the same as mine, 2 points) and that you have said something sensible based on them (one point).

  1. (3 points) Make a plot that includes bill length, bill depth, species, and sex. Does your plot support your conclusions from the previous question?

This plot is one you would make from the data, not of the predictions from the model (which is rather hard to interpret).

This has two quantitative variables (the first two) and two categorical (the last two). If you only had one categorical variable, this would be a scatterplot with the levels of the categorical variable as colour. The procedure we have seen for dealing with “too many” categorical variables is to use facets. My take is that sex is explanatory, so that is better as facets, and hence:

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
  geom_point() + facet_wrap(~ sex)

I am deliberately not using scales = "free" because I want to be able to compare where the points are on the two facets. Two points for the graph.

I have to say that I think these graphs are actually the best way towards understanding here. The bill length and bill depth do a great job of distinguishing the penguin species:

  • small bill length (and large bill depth) is Adelie
  • large bill length and large bill depth is Chinstrap
  • large bill length and small bill depth is Gentoo
  • this pattern is the same for males and females, but the males are bigger on both measurements than the females. (This last is the reason for having the same scales in both facets, and was also the reason for having you look at just females previously.)

For you, say something sensible about how the graph (for females) is consistent or not with what you found in the previous question. One point for that.

When I looked at the predicted probabilities, I got the same thing as this except that there, I predicted Gentoo for small bill depth and all bill lengths. (What actually happened is that there really weren’t any small bill lengths and small bill depths in the data, but the prediction had to come up with something, and the values we were predicting for were slightly closer to Gentoo than Adelie.)

Extra: I actually drew the graph first, and then tried to reverse-engineer a sensible prediction question for you.

plot_predictions doesn’t work so easily here because there are two explanatory variables in the model. To show both, because you want to keep colour for the species, you put the other ones on the end to make facets. This chooses five representative values for a quantitative explanatory variable and gives you a facet for each. I also added sex; plot_predictions will use facet_grid rather than facet_wrap when there are two explanatory variables after the thing that is colour. The facets for the first one are in columns and for the second one are in rows. The five different values for bill_depth_mm are hidden in the facet labels, and you can’t really see them, but take it that the plot is using something like a five-number summary:

plot_predictions(penguins.2, condition = c("bill_length_mm", 
                                           "group", "sex", 
                                           "bill_depth_mm"))

The species are actually very distinguishable (as you saw from the plot you made for this part), so most of the predicted probabilities are close to zero or one.2 Reading these graphs down the left side (for females):

  • when bill depth is small, it is a Gentoo
  • when bill depth is small-to-moderate:
    • when bill length is small, it is an Adelie
    • when bill length is large, it is a Gentoo
  • when bill depth is large:
    • when bill length is small, it is an Adelie
    • when bill length is large, it is a Chinstrap.

The actual definitions of “small” and “large” for bill depth depend on the value of bill length.

This is more or less the same as our graph (the one I had you make), and the pattern is almost exactly the same for males and females, except that for females, the third facet down is the same as the last two, and for males, it is the same as the second one.3

Ebola

Ebola is an infectious disease. It is spread via blood from an infected person or animal. There is now a vaccine, but in 2014 when our data were collected, there was no vaccine and people with Ebola had to be treated in hospital. The data in http://ritsokiguess.site/datafiles/ebola.csv are cases from an Ebola outbreak in Freetown, Sierra Leone. Each row of the data file is one person (“case”), and the following was recorded:

  • case_id: an ID for each case
  • date_hospitalisation: the date the person was admitted to hospital with Ebola
  • date_outcome: the date when it was known what happened to that person, either they were recovered or they died.
  • outcome: what the outcome was on that date (recovered or died).
  1. (1 point) Read in and display (a little of) the data.

There were a lot of cases, so you will only see the first few:

my_url <- "http://ritsokiguess.site/datafiles/ebola.csv"
ebola <- read_csv(my_url)
Rows: 4952 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): case_id, date_hospitalisation, date_outcome, outcome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ebola

Extra: these data came from a bigger dataset (that was itself extracted from a bigger dataset that needed to be tidied up), which I got from here.

ebola0 <- read_rds("~/Downloads/linelist_cleaned.rds")
glimpse(ebola0)
Rows: 5,888
Columns: 30
$ case_id              <chr> "5fe599", "8689b7", "11f8ea", "b8812a", "893f25",…
$ generation           <dbl> 4, 4, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 6, 5, 6, 9, 1…
$ date_infection       <date> 2014-05-08, NA, NA, 2014-05-04, 2014-05-18, 2014…
$ date_onset           <date> 2014-05-13, 2014-05-13, 2014-05-16, 2014-05-18, …
$ date_hospitalisation <date> 2014-05-15, 2014-05-14, 2014-05-18, 2014-05-20, …
$ date_outcome         <date> NA, 2014-05-18, 2014-05-30, NA, 2014-05-29, 2014…
$ outcome              <chr> NA, "Recover", "Recover", NA, "Recover", "Recover…
$ gender               <chr> "m", "f", "m", "f", "m", "f", "f", "f", "m", "f",…
$ age                  <dbl> 2, 3, 56, 18, 3, 16, 16, 0, 61, 27, 12, 42, 19, 7…
$ age_unit             <chr> "years", "years", "years", "years", "years", "yea…
$ age_years            <dbl> 2, 3, 56, 18, 3, 16, 16, 0, 61, 27, 12, 42, 19, 7…
$ age_cat              <fct> 0-4, 0-4, 50-69, 15-19, 0-4, 15-19, 15-19, 0-4, 5…
$ age_cat5             <fct> 0-4, 0-4, 55-59, 15-19, 0-4, 15-19, 15-19, 0-4, 6…
$ hospital             <chr> "Other", "Missing", "St. Mark's Maternity Hospita…
$ lon                  <dbl> -13.21574, -13.21523, -13.21291, -13.23637, -13.2…
$ lat                  <dbl> 8.468973, 8.451719, 8.464817, 8.475476, 8.460824,…
$ infector             <chr> "f547d6", NA, NA, "f90f5f", "11f8ea", "aec8ec", "…
$ source               <chr> "other", NA, NA, "other", "other", "other", "othe…
$ wt_kg                <dbl> 27, 25, 91, 41, 36, 56, 47, 0, 86, 69, 67, 84, 68…
$ ht_cm                <dbl> 48, 59, 238, 135, 71, 116, 87, 11, 226, 174, 112,…
$ ct_blood             <dbl> 22, 22, 21, 23, 23, 21, 21, 22, 22, 22, 22, 22, 2…
$ fever                <chr> "no", NA, NA, "no", "no", "no", NA, "no", "no", "…
$ chills               <chr> "no", NA, NA, "no", "no", "no", NA, "no", "no", "…
$ cough                <chr> "yes", NA, NA, "no", "yes", "yes", NA, "yes", "ye…
$ aches                <chr> "no", NA, NA, "no", "no", "no", NA, "no", "no", "…
$ vomit                <chr> "yes", NA, NA, "no", "yes", "yes", NA, "yes", "ye…
$ temp                 <dbl> 36.8, 36.9, 36.9, 36.8, 36.9, 37.6, 37.3, 37.0, 3…
$ time_admission       <chr> NA, "09:36", "16:48", "11:22", "12:60", "14:13", …
$ bmi                  <dbl> 117.18750, 71.81844, 16.06525, 22.49657, 71.41440…
$ days_onset_hosp      <dbl> 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 0, 2, 0, 1…

There were originally more cases than I saved for you, and a lot more variables. The original dates actually were dates, so I had to tinker with those to give you something to do. There were also a lot of missing outcome dates, and since the point of this problem was to have you work with dates, I figured I would get rid of those.

ebola0 %>% select(case_id, date_hospitalisation, date_outcome, outcome) %>% 
  drop_na(date_outcome) %>% 
  mutate(across(starts_with("date"), \(x) format(x, format = "%b %e, %Y"))) 

format is a multi-coloured function that reformats one thing as another thing (for example, displaying numbers with a certain number of decimals). When applied to an actual date, it reformats it as text according to the instructions in format: in this case, the (abbreviated) month name, the day of the month, and the (four-digit) year. (So, you see I am taking actual dates and making them not-dates so that you have to do the reverse.)

This is the dataframe I saved for you. Actually, not quite: sometimes the date of hospitalization and the outcome date were recorded the wrong way around, and the first time I tried this some of the hospital stays were a negative number of days (!), so I had to go back and make sure that the date of hospitalization was in fact earlier than the outcome date. That is what I ended up saving for you.

  1. (2 points) Convert the two date columns in your dataframe to actual dates, saving your dataframe and showing the results.

The best way to do this is in one step using across. The two columns you want to make real dates out of both have names that begin with date (and are the only ones that do). The text dates are month, day, year in that order, so mdy is what you need to do the conversion:

ebola %>% mutate(across(starts_with("date_"), \(date) mdy(date))) -> ebola2
ebola2

This overwrites the two date columns, which is fine because we don’t need the original ones for anything. You can see that the columns now really are dates and they display in ISO format.

If you don’t see this, do the two columns one at a time, which will be a bit more repetitive:

ebola %>% 
  mutate(date_hospitalisation = mdy(date_hospitalisation),
         date_outcome = mdy(date_outcome))

but comes to the same place. Full credit for either way.

  1. (4 points) Make a new column in your dataframe that is the number of days that each person was in hospital, and make a suitable plot of that number of days according to whether the person recovered or died. How do the numbers of days compare?

Three things to do.

First, work out the number of days each person was in the hospital, as a number. The best way is to make an interval using %--% (which will be a number of seconds internally), and then divide that by the number of seconds in a day:

ebola2 %>% 
  mutate(in_hosp = (date_hospitalisation %--% date_outcome) / days(1)) 

Two points for that. The second-best way to do it is to simply take the difference between the two dates:

ebola2 %>% 
  mutate(in_hosp_diff = date_outcome - date_hospitalisation)

but now you see that what I called in_hosp_diff is actually a time (what R calls a difftime) rather than a number, and when R displays a difftime, it picks a suitable unit for you. So this doesn’t quite answer the question. One point.

Second, a graph. This is a quantitative variable (the number of days) and a categorical one (the outcome), and so a boxplot is called for:

ebola2 %>% 
  mutate(in_hosp = (date_hospitalisation %--% date_outcome) / days(1)) %>% 
  ggplot(aes(x = outcome, y = in_hosp)) + geom_boxplot()

If you used the difftimes, your graph will come with a warning message:

ebola2 %>% 
  mutate(in_hosp_diff = date_outcome - date_hospitalisation) %>% 
  ggplot(aes(x = outcome, y = in_hosp_diff)) + geom_boxplot()
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.

which tells you that ggplot doesn’t actually know how to plot difftimes, so it is treating them as numbers. This is another reason why it’s better to go the first way (getting the numbers of days as numbers).

One point for the graph, since this is really C32 work.

Third, comment on the graph, for the last point. The people who recovered were on average in hospital longer than the ones who died, but there are a lot of outliers in both groups. That is, the people who died tended to die fairly quickly. (The best answer also comments on the NA group; these are people whose outcome date was recorded, but whose outcome itself was not, so we don’t know what happened to them.)

  1. (3 points) Epidemiologists (people who study epidemics) are interested in how the number of cases varies over time. Usually, they count the number of cases per week during the course of the epidemic. The function floor_date computes the date at the beginning of a time interval. It takes two inputs: a date, and an input called unit that says what you want the date to be the beginning of. Work out a column that is the date of the beginning of the week in which each hospitalization date falls (this is by default Monday), and use your new column to count the number of cases in each week.

Following the clues, one step at a time (for me: you can do it all at once):

ebola2 %>% 
  mutate(week = floor_date(date_hospitalisation, unit = "week")) 

You see that week is the beginning of the week in which date_hospitalisation falls, and it is indeed a date. Then:

ebola2 %>% 
  mutate(week = floor_date(date_hospitalisation, unit = "week")) %>% 
  count(week) -> ebola_week
ebola_week
  1. (2 points) Plot the number of cases per week against the date at the start of the week, joining the points by lines.

A gimme to finish with:

ggplot(ebola_week, aes(x = week, y = n)) + geom_point() + geom_line()

You can see the evolution of the epidemic: a few cases at first, then a rapid increase, then a gradual decline.

Extra: I’m not an epidemiologist, so my first thought was to plot the number of cases every day:

ebola2 %>% count(date_hospitalisation) %>% 
  ggplot(aes(x = date_hospitalisation, y = n)) + geom_point() + geom_line()

but the trend is much more noisy this way. Plotting weekly shows an up and down trend in the number of cases, in rather the same way that using a good number of bins on a histogram shows the shape of the distribution clearly (as opposed to getting a noisy picture if you use too many bins).

Footnotes

  1. For example, add up the males and females, or add up the numbers of each species.↩︎

  2. This is why I had to work so hard to have you find some predictions that were not close to 0 or 1.↩︎

  3. Look carefully to see which of those two close-together traces, because of the scale, is green and which one is blue.↩︎