Worksheet 2

Published

September 8, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial. There is probably too much here for you to finish in an hour, but I encourage you to continue later with what you were unable to finish in tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 The Boat Race

Each year, the Oxford and Cambridge University rowing teams race against each other on the River Thames in London (England). For the 1992 race, the weights (in pounds) of the participants on each team were recorded, and can be found in https://ritsokiguess.site/datafiles/boat_race.txt.

  1. Take a look at the data file (in your web browser), and describe how the data values are separated one from the next.

  2. Read the file into a dataframe and display at least some of it.

  3. Make a suitable graph of your data.

  4. Would you say, based on your plot, that the average or typical weights of the rowers on the two teams are similar or different? Explain briefly.

  5. Each rowing team consists of eight rowers plus a cox, whose job is to keep the rowers in tempo. The cox does not row themselves. Which of the nine individuals in each team do you think is the cox? Explain briefly.

My solutions follow:

  1. Take a look at the data file, and describe how the data values are separated one from the next.

Solution

Click on the URL, and see that the data values are separated by a single space. First comes the name of the university the rower comes from, then a single space, then the rower’s weight in pounds.

\(\blacksquare\)

  1. Read the file into a dataframe and display at least some of it.

Solution

The data values are separated by a single space, so you need read_delim with a single space as the second input. My habit is to save the (often long) URL into a variable first:

my_url <- "https://ritsokiguess.site/datafiles/boat_race.txt"
rowers <- read_delim(my_url, " ")
Rows: 18 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): university
dbl (1): weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rowers

You can check, by looking under the column heading, that the university names are text and the weights really are numbers (dbl means “double-precision decimal number”).

The alternative below works, but you have some extra work to do to explain why it works:

rowers0 <- read_delim(my_url)
Rows: 18 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): university
dbl (1): weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rowers0

To get full credit for doing it this way (if it were on an assignment), you would need to draw the reader’s attention to this line:

Delimiter: " "

This means that read_delim guessed, by looking at the file before reading it in, that the data values seemed to be separated by single spaces. It had to guess, because you didn’t say what to look for.

This also works, but in this course is wrong:

rowers0 <- read.table(my_url, header = TRUE)
rowers0

If you do it this way, it reveals that you are not paying attention. In the course outline it says “I expect you to do things as they are done in this course”, and in the lecture it says that read_delim is how to read a file of this type. You may have learned things differently elsewhere, but I do not use read.table in this course at all (and indeed, not very many base R ideas).

If you know anything about rowing, you might be a bit suspicious about there being nine observations, since a rowing team usually has eight people. See the last part of the question.

\(\blacksquare\)

  1. Make a suitable graph of your data.

Solution

First take a look at the type of variables you have: one categorical (the name of the university) and one quantitative (the weight of the rower). An appropriate graph for variables of this type is a side-by-side boxplot.

In ggplot the categorical variable is x because it goes horizontally, and the quantitative one is y (vertically):

ggplot(rowers, aes(x = university, y = weight)) + geom_boxplot()

This is the best plot. About the only other useful plot at all is above and below histograms. To get those, make a histogram of weights, and then facet by university, displaying the results in one column:

ggplot(rowers, aes(x = weight)) + geom_histogram(bins = 10) +
   facet_wrap(~university, ncol = 1)

If you go this way, you will almost certainly have to experiment with the number of bins. One of the automatic bin choices is unlikely to help you here, since so many of the bins are empty. Also, the purpose of the plot is to compare the distributions, like the boxplot, so you really need the histograms to be above and below, with a common \(x\)-scale, not left and right with a common count scale.

\(\blacksquare\)

  1. Would you say, based on your plot, that the average or typical weights of the rowers on the two teams are similar or different? Explain briefly.

Solution

The horizontal lines across the boxes on a boxplot are the medians of the distributions of weights. These medians are, I would say, very similar, especially given the amount of variability. (Have an opinion and defend it.)

From a boxplot, you cannot say anything about means, because they do not appear on a boxplot. With the kinds of distributions you have here, the mean is not a very sensible summary anyway, because of the outliers.

The use of the word “typical” in the question is meant to guide you towards a measure of centre, which might be mean, median, or even mode. A boxplot only shows you the median, which, as discussed, is a sensible measure of centre here anyway, so discuss that. (If your plot was the over-and-under histograms, you can try to figure out where, say, the medians are, or even use the mode. I care mostly about your thought process, not so much about the precise answer you get.)

\(\blacksquare\)

  1. Each rowing team consists of eight rowers plus a cox, whose job is to keep the rowers in tempo. The cox does not row themselves. Which of the nine individuals in each team do you think is the cox? Explain briefly.

Solution

The obvious guess is “the low outlier”. But you also need to say something about why: if the cox does not row, this means that the other rowers are expending energy moving the cox as well as the boat. Thus it is an advantage to have a cox who is as light in weight as possible. Hence the low outlier is most probably the cox. (This was indeed the case.)

The cox on a rowing team has a similar role to the conductor in an orchestra, except that the musicians in an orchestra are not trying to move the conductor across the water as fast as possible!1

\(\blacksquare\)

2 Intensive Care Unit patients

The Intensive Care Unit (ICU) at a hospital is where incoming patients that need the most urgent treatment are admitted. When a patient is admitted, a large number of measurements are taken, to help the ICU doctor decide on an appropriate treatment. The variables of interest to us here are these two (there are actually many others, as you will see):

  • sta: vital status (0 = lived, 1 = died)
  • typ: type of admission (0 = elective, 1 = emergency)

The data for 200 patients were in http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/icudat.txt.

  1. In your web browser, take a look at the data, and describe how the data are laid out.

  2. Read in and display (some of) the data.

  3. Make a suitable graph of the two variables of interest. Make sure you consider what type of variable these are (which might not be the same as how they are recorded).

  4. What do you learn from your graph, in the context of the data?

My solutions:

  1. In your web browser, take a look at the data, and describe how the data are laid out.

Solution

The columns are lined up. This is the most important thing, but you can also note that the values are separated by more than one space, so that read_delim will not work.

\(\blacksquare\)

  1. Read in and display (some of) the data.

Use read_table for the aligned columns:

my_url <- "http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/icudat.txt"
icu <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  .default = col_double()
)
ℹ Use `spec()` for the full column specifications.
icu

With so many rows and so many columns, you can only expect to see some of the data for sure. There are 200 rows as advertised, and a lot of columns (including, you can check, the ones we actually want).

\(\blacksquare\)

  1. Make a suitable graph of the two variables of interest. Make sure you consider what type of variable these are (which might not be the same as how they are recorded). (Hint: you may need to use factor(typ) or factor(sta), or both, inside your aes().)

Solution

These are recorded as 0 and 1, so you might think they are quantitative, but they are not: they are actually both categorical. A patient either lives or dies, and their surgery is either emergency or elective.2 So we have two categorical variables, and the right plot is a grouped bar chart:

ggplot(icu, aes(x = factor(typ), fill = factor(sta))) + geom_bar(position = "dodge")

There are two things to sort out: which variable is x and which is fill. You can try it both ways around and see which you like better, perhaps thinking about how you are going to answer the next part. I like this one better, because sta is really a response variable (whether a patient lives or dies depends on all the things that happened to them), while typ is an explanatory variable (what kind of procedure they were in the ICU for). If you have an obvious response variable, it usually makes more sense to put the other variable on the \(x\)-axis, because you can then say “out of all the patients who had an emergency procedure, how many of them died”, which seems like the logical way to say it.

The reason for the position = "dodge" is to put the bars within a group side by side. I think that makes them easier to compare than the default, which is to stack them on top of each other.

The reason for the factor things is that a grouped bar chart is for categorical variables, and as far as R is concerned, these variables are actually quantitative. We know that the 0 and 1 have no meaning as numbers, but R has no way to know that. What the factor says is “treat this variables as categorical, even though it may not look like it”.3

\(\blacksquare\)

  1. What do you learn from your graph, in the context of the data?

Solution

First, disentangle what those variables are: sta is whether each patient lived or died, 1 meaning “died”, and typ is what type of procedure each patient came into the ICU for, with 1 being emergency.

On my graph, for the patients that were in the ICU for an elective procedure (on the left), very few of them died, but for the patients in for an emergency procedure, a larger fraction of them died, though the majority still survived. (There were more patients altogether having an emergency procedure, but even so, the fraction of them that died is still larger.)

If you drew your graph the other way around, like this:

ggplot(icu, aes(fill = factor(typ), x = factor(sta))) + geom_bar(position = "dodge")

your conclusion winds up being this: out of the patients that lived, most of them were in the ICU for an emergency procedure, and out of the patients that died, almost all of them were in for an emergency procedure. I think you’ll agree that this is a much less clear conclusion.

\(\blacksquare\)

3 Hummingbirds and flowers

The tropical flower Heliconia is fertilized by hummingbirds, a different species for each variety of Heliconia. Over time, the lengths of the flowers and the form of the hummingbirds’ beaks have evolved to match each other. The length of the Heliconia flower is therefore an important measurement. Does it differ among varieties?

The data set at http://ritsokiguess.site/datafiles/heliconia_long.csv contains the lengths (in millimetres) of samples of flowers from each of three varieties of Heliconia: bihai, caribaea red, and caribaea yellow.

  1. Read in and display (some of) the dataset.

  2. Why would a boxplot be a suitable graph for these data? Explain briefly.

  3. Draw a boxplot of these data.

  4. What do you learn from your boxplot? Explain briefly.

  5. An alternative graph in this situation is a set of three histograms next to each other. Draw this graph. Consider your aims in drawing the graph; you will need to think about what “next to each other” most usefully means for you. Hint: problem 7.9(c) in PASIAS.

My solutions:

  1. Read in and display (some of) the dataset.

Solution

This is a .csv, so there is no particular difficulty:

my_url <- "http://ritsokiguess.site/datafiles/heliconia_long.csv"
heliconia <- read_csv(my_url)
Rows: 54 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): variety
dbl (1): length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heliconia

54 observations of flower variety and length, with the varieties as described in the question.

\(\blacksquare\)

  1. Why would a boxplot be a suitable graph for these data? Explain briefly.

Solution

In short, one categorical variable variety and one quantitative variable length. Make sure you say which variable is quantitative and which categorical, or else it looks as if you don’t know. Saying a bit more, the categorical variable variety denotes groups that we want to compare for length. (A categorical variable might only identify the observations; in that case, each “group” on a boxplot would only contain one observation!)

\(\blacksquare\)

  1. Draw a boxplot4 of these data.

Solution

This is nothing special apart from getting the right variables in the right places. On a boxplot, the grouping variable is x and the quantitative one is y:

ggplot(heliconia, aes(x = variety, y = length)) + geom_boxplot()

\(\blacksquare\)

  1. What do you learn from your boxplot? Explain briefly.

Solution

There are three things to say about how these distributions compare: centre (here median), spread (IQR) and shape:

  • centre: bihai has clearly the largest median (furthest up the page), and caribaea yellow has the lowest.
  • spread: caribaea red has the largest spread (IQR) because its box is the tallest. The other two distributions have about the same spread.
  • shape: bihai is skewed to the right (longer upper whisker), while the other two distributions are roughly symmetric.

\(\blacksquare\)

  1. An alternative graph in this situation is a set of three histograms next to each other. Draw this graph. Consider your aims in drawing the graph; you will need to think about what “next to each other” most usefully means for you. Hint: problem 7.9(c) in PASIAS.

Solution

The idea of three histograms in one plot is meant to make you think about facets. Your first attempt might look something like this:

ggplot(heliconia, aes(x = length)) + geom_histogram(bins = 6) +
  facet_wrap(~variety)

There are (at least) two problems with this:

  • the point of this plot, like that of the boxplot, is to compare the three distributions. Having the distributions left and right like this makes it difficult to compare centres and spreads; it would be much easier if the graphs were above each other, in one column.
  • each histogram appears to have only two or three bins, even though I specified six bins in the graph. (My choice of six was based on there being about 20 observations in each group, 54 observations divided by 3 groups being an average of 18 per group.) The problem here is that the number of bins is shared among the three histograms; the lengths overall go from about 32 to about 50, so the six bins are about 32–35, 35–38, 38–41, … , 47–50 and they are used on all three graphs. The bihai length values are higher than the others, as we saw on the boxplot, so they use only the highest two of these bins. To fix this, use more bins so that each histogram has about the right number. You can experiment (I found this way that 20 bins was pretty good), or note that if the three sets of lengths didn’t overlap at all, you would multiply the number of groups by the number of bins to get 18. These three groups do overlap a little, so 18 bins might be a little too many.

With all this in mind, the way to get the distributions side by side so that you can compare them is to put them in one column. There are actually two ways to do that. If you like facet_wrap, use that with ncol = 1 to get one column of graphs. You don’t get specific control over the layout with facet_wrap, but the use of ncol to get all the graphs in one column achieves the desired effect:

ggplot(heliconia, aes(x = length)) + geom_histogram(bins = 18) +
  facet_wrap(~variety, ncol = 1)

Note that each histogram now has something like six bins. The second one, for caribaea red, has more because this distribution is more spread out than the others (and all the bins are the same width), but each distribution has at least five bins (in the case of bihai, one of them is empty), which for me is enough to give at least some sense of shape. If it is not for you, use slightly more bins overall, maybe 20.

In the PASIAS question referred to in the hint, I also talk about facet_grid. Use that if it makes more sense to you. The idea there is that you specify the way to arrange the sub-graphs in the plot (the three histograms, in this case). The overall arrangement has an x and a y, distinct from the x and y in the aes() (the ones in aes() refer to each individual graph, rather than the overall arrangement). What you do is to say which (categorical) column refers to y first (before a squiggle) and x second (after the squiggle). If you don’t have anything to go in either slot, you use a dot.

Thus, the facet_grid version of the above graph is this:

ggplot(heliconia, aes(x = length)) + geom_histogram(bins = 18) +
  facet_grid(variety ~ .)

variety is determining how the histograms are arranged up and down, and there is nothing arranging them left and right.

\(\blacksquare\)

Extra discussion:

1(a):

extra 1: If you scroll down, you’ll see that the last rower seems to be much lighter in weight than the others. We get to this later.

extra 2: in this course, there is quite often a story of how the data came from the world to be a nice tidy data file for you. This is one of those cases. I don’t know how much of the story you’ll understand now, but I will do what I can to explain. Some of this comes from “choosing data” and some of it comes from “tidying data”, which is a bit further down the road.

The first bit is what is known as “web scraping”, which is not otherwise in the course, but you might be interested anyway. The data came from Wikipedia, to be precise here.5 Wikipedia articles are nice because the tables in them are easy to scrape once you know how. Web-scraping stuff lives in a package called rvest, and the first bit is not unlike reading data from a file:

library(rvest)
my_url <- "https://en.wikipedia.org/wiki/The_Boat_Race_1992"
html <- read_html(my_url)
html
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...

This reads the entire HTML code of that web page into a variable I called html. This is structured as two things: the header of the web page, containing things like the title, and the body, containing all the rest of it.

The next part turns the HTML into something we can work with in R. This bit says “take the HTML and pull out all the things in it that are tables” (specifically, that are HTML TABLEs). We are going to be making a lengthy pipeline, rather like an extended group-by and summarize:

html %>% html_table(header = FALSE)
[[1]]
# A tibble: 12 × 4
   X1                                  X2                            X3    X4   
   <chr>                               <chr>                         <chr> <chr>
 1 "138th Boat Race"                   "138th Boat Race"             <NA>  <NA> 
 2 "Date"                              "4 April 1992"                <NA>  <NA> 
 3 "Winner"                            "Oxford"                      <NA>  <NA> 
 4 "Margin of victory"                 ".mw-parser-output .frac{whi… <NA>  <NA> 
 5 "Winning time"                      "17 minutes 44 seconds"       <NA>  <NA> 
 6 "Overall record (Cambridge–Oxford)" "69–68"                       <NA>  <NA> 
 7 "Umpire"                            "Roger Stephens(Cambridge)"   <NA>  <NA> 
 8 "Other races"                       "Other races"                 <NA>  <NA> 
 9 "Reserve winner"                    "Goldie"                      <NA>  <NA> 
10 "Women's winner"                    "Cambridge"                   <NA>  <NA> 
11 "← 1991\n1993 →"                    "← 1991\n1993 →"              ← 19… 1993…
12 "← 1991"                            "1993 →"                      <NA>  <NA> 

[[2]]
# A tibble: 1 × 2
  X1     X2    
  <chr>  <chr> 
1 ← 1991 1993 →

[[3]]
# A tibble: 12 × 7
   X1                                   X2         X3    X4    X5    X6    X7   
   <chr>                                <chr>      <chr> <chr> <chr> <chr> <chr>
 1 Seat                                 Oxford     Oxfo… Oxfo… Camb… Camb… Camb…
 2 Seat                                 Name       Coll… Weig… Name  Coll… Weig…
 3 Bow                                  Kingsley … St J… 13 s… Max … Sidn… 13 s…
 4 2                                    Joseph Mi… Univ… 13 s… Nich… Jesus 13 s…
 5 3                                    Boris Mav… Jesus 14 s… Jame… Down… 13 s…
 6 4                                    Hamish Hu… Pemb… 13 s… Dani… Down… 13 s…
 7 5                                    Peter Bri… Oriel 13 s… Dona… Magd… 15 s…
 8 6                                    Calman Ma… Keble 14 s… Davi… St C… 14 s…
 9 7                                    Simon Davy Worc… 12 s… Step… Robi… 13 s…
10 Stroke                               Ian Gardi… St P… 13 s… Dirk… Fitz… 12 s…
11 Cox                                  Elizabeth… Chri… 7 st… Andr… Magd… 7 st…
12 Source:[12](P) – boat club president Source:[1… Sour… Sour… Sour… Sour… Sour…

[[4]]
# A tibble: 7 × 4
  X1                                                           X2    X3    X4   
  <chr>                                                        <chr> <chr> <chr>
1 ".mw-parser-output .navbar{display:inline;font-size:88%;fon… ".mw… ".mw… ".mw…
2 ""                                                           "Oxf… "Oxf… ""   
3 ""                                                           "The… "The… ""   
4 ""                                                           "The… "The… ""   
5 ""                                                           "Wom… "Wom… ""   
6 ""                                                           "The… "The… ""   
7 "Category\n Commons"                                         "Cat… "Cat… "Cat…

There are four tables in the document, and they display as an R list, the details of which you don’t need to concern yourself with now. The table we want is the third one. I explain the header thing (below) in a moment. Let’s grab just the third table (we don’t need the others at all):

html %>% html_table(header = FALSE) %>%
  pluck(3)

Now you see why I said there were no headers: there are actually two rows of headers, one saying which university the rower is from (first Oxford, and then scroll right to see Cambridge), and the second row is what we would normally think of as headers. I didn’t want to be dealing with that, so I let html_table choose some column names for me, the things beginning with X. I can be pretty relaxed about dealing with this, because all I want is the weights, which are in X4 for the Oxford rowers and X7 for the Cambridge ones. So those are the only two columns I need.

I don’t need all the rows, either; the first two are headers, and the last one is notes, so I only want rows 3 through 11. select chooses columns and slice chooses rows by number (these are from the “choosing things” lecture):

html %>% html_table(header = FALSE) %>%
  pluck(3) %>%
  select(X4, X7) %>%
  slice(3:11)

The weights are in funky units (that we will have to deal with shortly), but the immediate problem for making boxplots is that this is the wrong shape. We want one column of weights, and a second column saying which university that rower comes from. This is what pivot_longer makes. This comes from the “tidying data” lecture later, so don’t worry about making sense of it now:

html %>% html_table(header = FALSE) %>%
  pluck(3) %>%
  select(X4, X7) %>%
  slice(3:11) %>%
  pivot_longer(everything(), names_to = "col", values_to = "weight_txt")

There are now 18 rows, one row per person, with that person’s weight, and the text in col saying which column that weight came from.

You can go back and check that the first row here is the first Oxford rower, and the second one is the first Cambridge rower, and so on. Let’s identify the things in col with the proper university next:

html %>% html_table(header = FALSE) %>%
  pluck(3) %>%
  select(X4, X7) %>%
  slice(3:11) %>%
  pivot_longer(everything(), names_to = "col", values_to = "weight_txt") %>%
  mutate(university = ifelse(col == "X4", "Oxford", "Cambridge"))

mutate creates a new column in terms of an old one (from the “choosing things” lecture). Our new column is called university, and the value in it has to be the text Oxford if col is X4 and Cambridge otherwise. To make this happen, use ifelse, which is like =IF in Excel: if the first thing is true, then the result is the second thing; otherwise, the result is the third thing.6 All right, those weights in weight_txt.7 What on earth does “st” mean? If you are in the UK, as the author of this Wikipedia article was, you measure weights of humans in “stones”, with 1 stone being 14 pounds. So the first rower weighs this many pounds:

13*14 + 4
[1] 186

So we now have to get those numbers out for each rower, and then we have to use them to calculate weights in pounds. The extraction is a bit fiddly, using something called “regular expressions”:

html %>% html_table(header = FALSE) %>%
  pluck(3) %>%
  select(X4, X7) %>%
  slice(3:11) %>%
  pivot_longer(everything(), names_to = "col", values_to = "weight_txt") %>%
  mutate(university = ifelse(col == "X4", "Oxford", "Cambridge")) %>%
  separate_wider_regex(weight_txt, c(stone = ".*", " st ", lbs = ".*", " lb"))

In words: take the column weight_txt and divide it up into a column called stone and a column called lbs. The recipe for doing that is the second input to separate_wider_regex. It says:

  • grab any number of any character and call it stone.
  • look for the letters st with a space either side, but don’t call it anything.
  • grab any number of any character and call it lbs.
  • look for the letters lb with a space in front, but don’t call it anything.

It is probably not obvious how this works. The idea is that the regular expression parser finds a way to make the whole thing match, and the only way to match the st with a space either side is to have the first “any number of any character” match only the 13 or 14 before st, and the only way to match the lb at the end as well is to have the second “any number of any character” match the decimal number of pounds in between the text st and the text lb.

The last-but-one thing to say here is that separate_wider_delim makes the column weight_txt disappear, on the basis that after you’ve done the extraction, you usually don’t need the original any more (as here), though there is an option cols_remove = FALSE by which you can keep it if you need to. The last thing to say here is that separate_wider_regex is a more complicated version of separate_wider_delim, which we will see later; we could have used that if the weights had been written like 13:4 for 13 stone 4 lbs, with a predictable something separating each number.

From here, a mutate to calculate the weights in pounds, allowing for the fact that stone and lbs are actually still text, even though they look like numbers, so we need to convert them to numbers before doing arithmetic with them:8

html %>% html_table(header = FALSE) %>%
  pluck(3) %>%
  select(X4, X7) %>%
  slice(3:11) %>%
  pivot_longer(everything(), names_to = "col", values_to = "weight_txt") %>%
  mutate(university = ifelse(col == "X4", "Oxford", "Cambridge")) %>%
  separate_wider_regex(weight_txt, c(stone = ".*", " st ", lbs = ".*", " lb")) %>%
  mutate(weight = 14 * as.numeric(stone) + as.numeric(lbs))

and the final steps are to grab only the columns we need now, and to save it to a file for you. There is a write_delim that is the opposite of read_delim:

html %>% html_table(header = FALSE) %>%
  pluck(3) %>%
  select(X4, X7) %>%
  slice(3:11) %>%
  pivot_longer(everything(), names_to = "col", values_to = "weight_txt") %>%
  mutate(university = ifelse(col == "X4", "Oxford", "Cambridge")) %>%
  separate_wider_regex(weight_txt, c(stone = ".*", " st ", lbs = ".*", " lb")) %>%
  mutate(weight = 14 * as.numeric(stone) + as.numeric(lbs)) %>%
  select(university, weight) %>%
  write_delim("boat_race.txt")

This produces no output because the output gets written to the file, which looks like this:9

cat boat_race.txt
university weight
Oxford 186
Cambridge 188.5
Oxford 184.5
Cambridge 183
Oxford 204
Cambridge 184.5
Oxford 184.5
Cambridge 185
Oxford 195.5
Cambridge 214
Oxford 202.5
Cambridge 203.5
Oxford 174
Cambridge 186
Oxford 183
Cambridge 178.5
Oxford 109.5
Cambridge 109

2(a): Medical people like to code variables as 1 and 0, which can variously denote “present/absent”, “yes/no”, “event/no event”, “abnormal/normal”, or any other two categories. This is convenient for recording (just type one digit into your spreadsheet), but is not so useful for us as statisticians, who have to keep referring back to the data description to find out what 1 and 0 mean in each case.

For each of the categorical variables of interest to us, let’s redefine their values to be something more meaningful than 1 and 0. Hint for you: recall, or find out, how ifelse works in defining new columns.

There is no harm in keeping all the variables, but the ones we want are not very many,10 so I am going to select them first. Also, you can use the same names for the new variables, or you can use new names. I’m going to use new names, at least at first, so that I can compare the new values with the old ones and convince myself that my values are correct. ifelse takes three inputs: something that is true or false, the value to return if true, and the value to return if false. With those things in mind, let’s go:

icu %>%
  select(sta, age, sex, ser, sys, typ) %>%
  mutate(sta1 = ifelse(sta == 0, "lived", "died"),
         sex1 = ifelse(sex == 0, "male", "female"),
         ser1 = ifelse(ser == 0, "medical", "surgical"),
         typ1 = ifelse(typ == 0, "elective", "emergency")) -> icu1
icu1

Coding note: you can have four mutates, one after the other, or you can put the definitions of four new columns into one mutate. The value of using new column names is that you can check the new values against the old ones and make sure they make sense. You’ll have to scroll down a bit to find some patients who died (and thus that their sta value is correct). I think it is better, at least at first, when you are checking that it works, to define new columns, so that you can eyeball them side by side with the original ones. It is up to you whether you then keep the new column names, or do it again overwriting the old ones, but I think you should at least make new column names first to convince yourself that you are doing the right thing.11 In a real report, you might get away with putting the new values back into the original variables without comment, but here, while we are still learning, you ought to say that the recoded and original values do match up in each case and that you checked.

Aside: a more formal way to check that things match up is to count them, for example:

icu1 %>% count(sta, sta1)

and then you can check the data description to be sure that 0 should be “lived” and 1 should be “died”. These all match up. You would know there was trouble if some of the lived had sta 0 and some of them sta 1. End of aside.

Finally, make a tidy version (with the same variable names as before) and save it:

icu1 %>%
  select(sta, age, sex, ser, sys, typ) %>%
  mutate(sta = ifelse(sta == 0, "lived", "died"),
         sex = ifelse(sex == 0, "male", "female"),
         ser = ifelse(ser == 0, "medical", "surgical"),
         typ = ifelse(typ == 0, "elective", "emergency")) -> icu2
icu2
write_csv(icu2, "icudat.csv")

and that’s the one you would actually use for an analysis, such as this:

ggplot(icu2, aes(x = typ, fill = sta)) +
  geom_bar(position = "dodge")

and you see that it is much more obvious how to read this graph, such as noting that very few of the elective patients died, without having to go back and find out what those 0s and 1s mean.

Extra 2: With a big data set like this one, it is perhaps better to inspect the data read in with glimpse:

glimpse(icu)
Rows: 200
Columns: 20
$ sta  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ age  <dbl> 27, 59, 77, 54, 87, 69, 63, 30, 35, 70, 55, 48, 66, 61, 66, 52, 5…
$ sex  <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ race <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ser  <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,…
$ can  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ crn  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ inf  <dbl> 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,…
$ cpr  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ sys  <dbl> 142, 112, 100, 142, 110, 110, 104, 144, 108, 138, 188, 162, 160, …
$ hra  <dbl> 88, 80, 70, 103, 154, 132, 66, 110, 60, 103, 86, 100, 80, 99, 90,…
$ pre  <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,…
$ typ  <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0,…
$ fra  <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ po2  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ ph   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ pco  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ bic  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
$ cre  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,…
$ loc  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,…

This shows you all the column names (among which you can find the ones of interest to us), and a number of values. This makes it clear that a lot of the variables are coded as 1 and 0.

2(d): We seem to have discovered that knowing about the procedure a patient was in the ICU for tells us something about whether they will live or die. The best-known way of testing for this is the chi-squared test for independence. That goes like this (using the dataframe we made with the nice names):

first you make a table of how many observations (patients) were in each category combination:

tab <- with(icu2, table(typ, sta))
tab
           sta
typ         died lived
  elective     2    51
  emergency   38   109

and then the chi-squared test goes like this:

icu.1 <- chisq.test(tab)
icu.1

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 10.527, df = 1, p-value = 0.001177

We find a significant association between typ and sta: knowing the type of procedure really does tell us about the survival chances.

How? You might remember doing these by hand (especially if you did the psych stats course), in which case you’ll remember that you had to calculate observed and expected frequencies first. The idea is to work out what you’d expect if there is actually no association at all between the two categorical variables. If the observed and expected are too different, then you reject the idea that there is no association and conclude that there was an association after all, which is what happened here.

To get an idea of what kind of association there is, you can look at the observed (above) and the expected (below) and see where they are most different:

icu.1$expected
           sta
typ         died lived
  elective  10.6  42.4
  emergency 29.4 117.6

We were not expecting many deaths among the elective patients, because there aren’t many of them, but we observed even fewer, and there were more deaths than expected among the emergency patients. This would make sense, because emergency cases are things like heart attacks where some patients are not going to survive no matter how skilled the ICU doctor is, while elective cases are things the patient’s doctor has recommended, without the same level of urgency.

3(e): facet_grid can be useful when you have two categorical variables, since then you can make a two-dimensional array of subplots, with one categorical variable going across and the other one going up and down. For example, you could use this if you have two quantitative and two categorical variables; you make the sub-plots scatterplots and then you can see how the trends change if either or both of the categorical variables change.

Footnotes

  1. Not even if they are playing this!↩︎

  2. A way to think about this is to ask yourself whether the numbers have any meaning as numbers. If they were, say, ages, the actual numbers would be meaningful, but here, the 0 and 1 are actually labels for two different categories.↩︎

  3. A factor is R’s term for a variable that must be treated as categorical. Usually the values of a categorical variable are text, like lived or died, and those will properly be treated as categorical, but sometimes a categorical variable is expressed as numeric labels, as here, and then you need the factor to make it clear that the labels have no meaning as numbers.↩︎

  4. It’s up to you whether you call this one boxplot, or one graph containing three side-by-side boxplots. Think about what makes more sense to you, and what will make more sense to your reader.↩︎

  5. It is called “The Boat Race” because, like many British things, it was the first organized rowing race, starting in 1829.↩︎

  6. If you have more than two things to decide between, you might be accustomed to using nested IFs in Excel, which are very hard to read. In R, there is a thing called case_when that is designed for this.↩︎

  7. I used a “disposable” name for these, saving weight for the actual weights in pounds that we are going to calculate.↩︎

  8. Note the chr at the top of each of those columns.↩︎

  9. in the bash shell, which is the one that R Studio’s “terminal” uses, cat is how you display a file. You can also have a bash code chunk, or even a python one.↩︎

  10. When I originally used these data, there were six of them I wanted to keep.↩︎

  11. Bear in mind that if you are not doing the right thing here, all the analysis that you do later will be nonsense, and on an assignment, you are dependent on the good will of the grader to get any credit for it.↩︎