STAC33 Assignment 8

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Ranking sports

130 people were asked to rank seven sports according to how much they would like to participate in them. A rank of 1 means that the person would most like to participate in that sport, and a rank of 7 means that the person would least like to participate in that sport. The seven sports are baseball, football, basketball, tennis, cycling, swimming, and jogging. The data are in http://ritsokiguess.site/datafiles/sports-ranks.csv.

(a) (1 point) Read in and display (some of) the data.

Surprises? None:

my_url <- "http://ritsokiguess.site/datafiles/sports-ranks.csv"
ranks <- read_csv(my_url)
Rows: 130 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): person_id, Baseball, Football, Basketball, Tennis, Cycling, Swimmin...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ranks

(b) (4 points) Which sport was ranked #1 by the most people? Show the code you used to work this out. You will probably want to rearrange the data first. Summarize your code in words with a sentence or bullet point for each line of code.

To start with, a standard pivot-longer. Find a way of selecting all the columns that you want to pivot longer (all the columns that name sports, or all the columns except the person’s ID, the latter being cleaner):

ranks %>% 
  pivot_longer(-person_id, names_to = "sport", values_to = "rank") 

Note that there are now many more rows, and each person is listed 7 times (once for each of the seven ranks). Having all the ranks in one column makes it much easier to count them. This is what I did:

ranks %>% 
  pivot_longer(-person_id, names_to = "sport", values_to = "rank") %>% 
  filter(rank == 1) %>% 
  count(sport) %>% 
  arrange(desc(n))

The sport ranked #1 most often was baseball. (You need to say this in order to answer the question.)

My strategy (in bullet points) was:

  • arrange all the ranks in one column
  • grab only the #1 ranks (since these were the only ones we care about)
  • count how many times each sport appeared (ranked 1)
  • arrange them in order.

Another good alternative is to find only the top one:

ranks %>% 
  pivot_longer(-person_id, names_to = "sport", values_to = "rank") %>% 
  filter(rank == 1) %>% 
  count(sport) %>% 
  slice_max(n, n = 1)

The last line is rather confusing: the first n is the n column that the count produced (as you discovered by running this one line at a time), and the second one is a named input to slice_max, showing how many of the highest values you want to display. Your last bullet point for this one would be something like “display only the largest frequency”.

A rather more clunky way is to do this for all the ranks:

ranks %>% 
  pivot_longer(-person_id, names_to = "sport", values_to = "rank") %>% 
  group_by(rank) %>% 
  count(sport) %>% 
  slice_max(n, n = 1)

and then pick out the results for rank 1, which says that baseball was most often ranked 1. (Jogging was the least popular, by any standard: least often ranked #1, most often ranked #6 and #7.)

Get to an answer of “baseball, ranked #1 by 25 people” somehow by some code, and explanation of that code, that will work for any dataset arranged like this.

2 Case rates of a disease

The dataset in http://ritsokiguess.site/datafiles/tab5.csv contains the number of cases of a disease and the total population of each of three countries in each of two years. The number of cases and the total population are in the same column, separated by a /.

(a) (1 point) Read in and display the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/tab5.csv"
tab <- read_csv(my_url)
Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, 1999, 2000

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tab

As promised. We have a column of country names, and one column for the cases-and-populations for each year.

(b) (3 points) Reorganize the data to have a column of countries, a column of years, a column of cases, and a column of populations.

“A column of years” is a big hint to start by putting the cases and populations into one column, by using pivot_longer:

tab %>% 
  pivot_longer(-country, names_to = "year", values_to = "stats")

Now we have to separate the two things in stats (you might have had trouble finding a name for this column). They are separated by a /, so separate_wider_delim is what you need (the two values are “delimited” by a slash):

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/")

This is not one of those ones you can do using the first variant of pivot_longer with two things in names_to, because the column name only encodes one thing: the year that the observations were taken. It’s the values that have two things in them, and so the values need to be split up (and you cannot put two things in values_to).

Save this, if you want to, and use the saved dataframe as input to the next part. Or do it as I did. Either way is good.

(c) (3 points) Create a column of case rates (the number of cases per million population). Pay close attention to the types of the columns concerned, and if you need to use anything from outside the lecture notes, cite your source.

Your first thought will probably be something like this:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(case_rate = cases / population * 1e6)
Error in `mutate()`:
ℹ In argument: `case_rate = cases/population * 1e+06`.
Caused by error in `cases / population`:
! non-numeric argument to binary operator

“A non-numeric argument to a binary operator” means that when you tried to divide, something that you were dividing was not a number. If you look carefully at the top of the columns, you’ll see they are chr or text, not the numbers they appear to be.

The standard way to convert values that are actually one thing but look like something else is via a function that starts with as., in this case, as.numeric. This is not in the lecture notes, so you need to say where you found out about it. If it’s in PASIAS, which it probably is, give the problem number where you found it. If you saw it in another course, say which course.

There are two ways you could use as.numeric: directly, which involves using it twice:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(case_rate = as.numeric(cases) / as.numeric(population) * 1e6)

Or you can convert the cases and population columns to numeric first, using a mutate with an across (or less elegantly with two mutates), and then do an ordinary division:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(across(c(cases, population), \(x) as.numeric(x))) %>% 
  mutate(case_rate = cases / population * 1e6)

The third line of my code, in English, reads “take the columns cases and population, and redefine them to be the numeric versions of themselves.”

The 1e6 is scientific notation for a million, \(1 \times 10^6\):

format(1e6, scientific = FALSE)
[1] "1000000"

which is more reliable than remembering to type six zeros every time you use it (and risking getting one of them wrong).

One thing we did learn about in class was parse_number, when we were trying to get numeric dates in the Toronto weather data. The typical use of parse_number is to pull the number out of a thing like d12, but you might be wondering whether you can use it on a piece of text that is an actual number. Try it and see:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(cases = parse_number(cases))

It does indeed work (the column cases has become numbers), so you can use parse_number either of the two ways you used as.numeric, that is, either this:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(case_rate = parse_number(cases) / parse_number(population) * 1e6)

or this:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(across(c(cases, population), \(x) parse_number(x))) %>% 
  mutate(case_rate = cases / population * 1e6)

Get to a column of case rates with those numbers using parse_number (uncited, because it’s in the lecture notes) or as.numeric (citing your source).

(d) (3 points) Rearrange your data to contain columns named for each year, containing the case rate for each country in that year, with one row per country.

This means getting columns whose names are 1999 and 2000, with the values of the case rates in them. Your first thought may well be this:

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(across(c(cases, population), \(x) parse_number(x))) %>% 
  mutate(case_rate = cases / population * 1e6) %>% 
  pivot_wider(names_from = year, values_from = case_rate)

This looks like one of the examples in “when pivot-wider goes wrong”, with extra missing values that we don’t want. It happened for the same reason as that. Thinking back to the result of (c), the columns named in our pivot-wider are year and case_rate, and the issue is what determines the rows: the different combinations of all the other variables, including cases and population, which have six different values in them. There are therefore six rows still, with each country appearing twice.

Look back to the question: I don’t need the columns cases and population in the final answer,1 so what happens if I get rid of them before pivoting wider?

tab %>% pivot_longer(-country, names_to = "year", values_to = "stats") %>% 
  separate_wider_delim(stats, names = c("cases", "population"), delim = "/") %>% 
  mutate(across(c(cases, population), \(x) parse_number(x))) %>% 
  mutate(case_rate = cases / population * 1e6) %>% 
  select(country, year, case_rate) %>% 
  pivot_wider(names_from = year, values_from = case_rate)

You can select the columns you want to keep, or use minus signs inside the select before the names of the columns you want to get rid of: select(-cases, -population).

That worked. It worked because the only column not now named in the pivot-wider was country; there are only three different countries, so we get only three different rows. Find a way to get to this last dataframe, and show your code so that we know how you got there. (Make sure your code uses only ideas from lecture plus possibly as.numeric that you cite a source for.) I am happy if you show all your code plus the result for each part, but you should think for yourself about why your code works as well, and some words of explanation will help to convince the grader that you know what you are doing.

Extra 1: These are actually tuberculosis cases, probably from the same source as our tuberculosis data in lecture, but cut down to three countries and two years so that you can see what’s going on. Now, we can see that the case rates increased dramatically in Afghanistan and Brazil, but remained steady in China.

Extra 2: Note that we ended up with the same layout of data at the end as we had at the beginning (one column for each year), but to get these numbers, we went via longer data (to do the calculations) and then made it wider again at the end (for looking at). This is common. You see also that the calculation part was a lot simpler with all the cases and populations in one column; otherwise, we would have had to do everything for each year, and it would have gotten very repetitive (and prone to error). Imagine if you had 10 or 20 years instead of only two.

Footnotes

  1. I didn’t say to exclude them, but I didn’t say you had to include them, either.↩︎