Worksheet 10

Published

November 11, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 Admissions to graduate departments at Berkeley

The University of California Berkeley is a very prestigious institution. Only a small fraction of students that apply to study there are accepted. One year, there was concern that females were being discriminated against in their applications, so the number of males and females admitted and rejected by six different graduate departments (labelled A through F) was recorded. The data are in http://ritsokiguess.site/datafiles/berkeley.txt, but are not very tidy. The data values are separated by tabs.

  1. Read in and display the data. Describe briefly what you see, and what a tidy version of the data would look like.

  2. Tidy the data by pivoting longer and then sorting out the gender and admission status.

  3. Do the same tidying using only one pivot-longer.

  4. Starting again from the dataframe you read in from the file, how can you arrange it to have two columns of frequencies, one with the number of students admitted, and one with the number of students rejected? (There are two ways, one of which is made by fixing up the dataframe of the previous part, and the other of which is directly made with one pivot-longer.) Save the dataframe that comes out of this.

  5. The rest of the way, we investigate the possible gender discrimination, using the tidy dataframe we made in (d). (You may wish to skip the rest of this question on first reading and come back to it later.) First, work out the overall proportion of males and females that were admitted out of all the applicants.

  6. Repeat the previous part, but for each department separately. (The modification to your code is a very small one). Are your results from (e) and (f) consistent, or not?

Admissions to graduate departments at Berkeley: my solutions

The University of California Berkeley is a very prestigious institution. Only a small fraction of students that apply to study there are accepted. One year, there was concern that females were being discriminated against in their applications, so the number of males and females admitted and rejected by six different graduate departments (labelled A through F) was recorded. The data are in http://ritsokiguess.site/datafiles/berkeley.txt, but are not very tidy. The data values are separated by tabs.

(a) Read in and display the data. Describe briefly what you see, and what a tidy version of the data would look like.

This is read_tsv, like the athletes data:

my_url <- "http://ritsokiguess.site/datafiles/berkeley.txt"
admissions <- read_tsv(my_url)
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Dept
dbl (4): Males:admitted, Males:rejected, Females:admitted, Females:rejected

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions

We have a column labelling the departments, and four columns of frequencies showing the numbers of males and females admitted and rejected by each of those departments. These are not tidy because each frequency is really one observation, so we would like to get a column of frequencies with other columns saying what department, gender and admission status that frequency goes with.

(As it turns out, we will find that a different form of “tidy” will be easier to address discrimination with.)

I suspect these values came out separated by tabs because I copied and pasted them from a website.

(b) Tidy the data by pivoting longer and then sorting out the gender and admission status.

The usual with these is to do the pivot-longer first and see what you have then:

admissions %>% 
  pivot_longer(-Dept, names_to = "gen_admit", values_to = "freq")

That’s an improvement, with the frequencies all in one column, but gender and admission status are glued together and so we need to separate them out. They are separated by a particular character (a colon), so separate_wider_delim will split them up:

admissions %>% 
  pivot_longer(-Dept, names_to = "gen_admit", values_to = "freq") %>% 
  separate_wider_delim(gen_admit, ":", names = c("gender", "admission"))

and now we have something tidy.

(c) Do the same tidying using only one pivot-longer.

This is one of the variations of pivot_longer. Here’s where we are starting from:

admissions

The column names encode two things, separated by a colon, so there are really two names-to, which we specify thus:

admissions %>% 
  pivot_longer(-Dept, names_to = c("gender", "admission"), 
               names_sep = ":", values_to = "freq") 

Note that when you have more than one names_to, you also need to say what they are separated by, in names_sep. My habit is to specify all the names stuff first, and then get to the values stuff.

This is actually tidy now, and so the usual thing is to save this one, but the dataframe we get in the next part turns out to be easier for us to use, and so we will save that one instead.

(d) Starting again from the dataframe you read in from the file, how can you arrange it to have two columns of frequencies, one with the number of students admitted, and one with the number of students rejected? (There are two ways, one of which is made by fixing up the dataframe of the previous part, and the other of which is directly made with one pivot-longer.) Save the dataframe that comes out of this.

For the first way, re-do the previous part:

admissions %>% 
  pivot_longer(-Dept, names_to = c("gender", "admission"), 
               names_sep = ":", values_to = "freq") 

and you see that we can get where we want to be by pivoting-wider the admission column, carrying along the frequencies:

admissions %>% 
  pivot_longer(-Dept, names_to = c("gender", "admission"), 
               names_sep = ":", values_to = "freq") %>% 
  pivot_wider(names_from = admission, values_from = freq)

You are (I hope) feeling a little uneasy about pivoting longer and then immediately wider again (and, indeed, pivoting wider one of the same things we pivoted longer). Surely there must be a way of doing it in one step. There is, but it is not obvious. Return to the data we read from the file:

admissions

and see that we want to do some kind of pivot-longer, with the males and females going into a column called gender, but we want to use admitted and rejected as column names, which is different. This last is accomplished by the special code .value in names_to:

admissions %>% 
  pivot_longer(-Dept, names_to = c("gender", ".value"), names_sep = ":",
               values_to = "freq") -> ar
ar

in one step! (I have to admit that I feel like a genius when I do this and it comes out right!)

I saved this one. I called it ar, to stand for “admitted and rejected”. No doubt you have a better name.

(e) The rest of the way, we investigate the possible gender discrimination, using the tidy dataframe we made in (d). (You may wish to skip the rest of this question on first reading and come back to it later.) First, work out the overall proportion of males and females that were admitted out of all the applicants.

This works better with the dataframe I called ar, because you can see how to work out proportions: as admitted divided by admitted plus rejected. A good strategy is to do that at the end, after totalling up the counts you want to make proportions out of. That goes in two steps. First, sum up the applicants admitted and rejected by gender:

ar %>% 
  group_by(gender) %>% 
  summarize(admitted_total = sum(admitted), rejected_total = sum(rejected)) 

and then, having done that, turn these into proportions:

ar %>% 
  group_by(gender) %>% 
  summarize(admitted_total = sum(admitted), rejected_total = sum(rejected)) %>% 
  mutate(prop = admitted_total / (admitted_total + rejected_total))

Here is where the discrimination claim came from: 45% of males were admitted, but only 30% of females. That seems like a big difference, and it appears that something is going on.

(f) Repeat the previous part, but for each department separately. (The modification to your code is a very small one). Are your results from (e) and (f) consistent, or not?

The modification is to add Dept to the group_by. As to where: we want to get to the proportions of males and females admitted within each department, so we add the department to the group_by first, thus:

ar %>% 
  group_by(Dept, gender) %>% 
  summarize(admitted_total = sum(admitted), rejected_total = sum(rejected)) %>% 
  mutate(prop = admitted_total / (admitted_total + rejected_total))
`summarise()` has grouped output by 'Dept'. You can override using the
`.groups` argument.

Remember that, overall, males were quite a bit more likely to be admitted than females, and we were worried about discrimination. But here, in four out of the six departments, females were more likely to be admitted than males, and in the other two departments, males have only a small edge over females. This seems entirely inconsistent with the overall picture: how can more males be admitted in one, and more females be admitted in the other? How is this even possible?

This is actually Simpson’s Paradox: by aggregating (summing up over all departments), males come out better, but by looking at departments separately, the females come out better. It seems as if it should be impossible.

There are actually two parts to the explanation:

First, the above shows that the departments differ considerably in how difficult it is to get admitted. Department A accepts most people that apply there, but Department F accepts almost no-one (let me guess: I bet it’s the medical school). So you should be feeling uneasy about combining these very different departments.

Second, let’s work out how many total applications there were by males and females to each department:

ar %>% 
  mutate(total_admissions = admitted + rejected) %>% 
  select(Dept, gender, total_admissions) %>% 
  pivot_wider(names_from = gender, values_from = total_admissions)

(code note: the reason for the pivot-wider was to put males and females side by side for each department, and the reason for the select before that was to make the pivot-wider work properly.)

A lot more males than females applied to departments A and B, where most of the applicants were admitted. But more females than males applied to departments C and E, where it was harder to be admitted.

In other words, the reason that more males than females got admitted overall is because they tended to apply to departments where more people got admitted! Once you look at each department individually (rather than combining over the very different departments), the discrimination completely goes away.

Another statistical way of thinking about this is that an applicant’s chance of getting admitted depends (possibly) on whether they are male or female, but also on what department they are applying to, and you need to consider both of these to come to a justifiable conclusion.

2 House prices revisited

Earlier, we rearranged data on house prices from multiple columns into the one column of prices we needed to make a boxplot. In this question, we go the other way: starting from the dataframe used to make the boxplot, can we turn it into the dataframe that we read in from the file? The original “longer” data, or something very like it, is in http://ritsokiguess.site/datafiles/homes_long.csv. I suggest you try and see how far you can get (hints: pivot_wider, organizing rows) before looking at what I did.

  1. Read in and display (some of) the data.

  2. Rearrange the data to make it look something like what you read in in the earlier question (that is, with separate columns for 3-bedroom houses and 4-bedroom houses, each column containing prices for houses with that many bedrooms).

House prices revisited: My solutions

Earlier question, we rearranged data on house prices from multiple columns into the one column of prices we needed to make a boxplot. In this question, we go the other way: starting from the dataframe used to make the boxplot, can we turn it into the dataframe that we read in from the file? The original “longer” data, or something very like it, is in http://ritsokiguess.site/datafiles/homes_long.csv. I suggest you try and see how far you can get (hints: pivot_wider, organizing rows) before looking at what I did.

  1. Read in and display (some of) the data.

Solution

Let’s start with our “longer” dataframe in the file:

my_url <- "http://ritsokiguess.site/datafiles/homes_long.csv"
asking_longer <- read_csv(my_url)
Rows: 37 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): bedrooms
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
asking_longer

That’s about where we finished before, and that would be a suitable dataframe for making a boxplot or running a two-sample \(t\)-test. The two columns are called price and bedrooms.

(I changed things around a little; the bedrooms column is called something more memorable, and the values in it are text rather than numbers.)

  1. Rearrange the data to make it look something like what you read in in the earlier question (that is, with separate columns for 3-bedroom houses and 4-bedroom houses, each column containing prices for houses with that many bedrooms).

Solution

So, turning it from longer to wider is pivot_wider. What about this, then?

asking_longer %>% 
  pivot_wider(names_from = bedrooms, values_from = price)
Warning: Values from `price` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(bedrooms) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

Well, that did… something. But it is instructive to work out what happened, so that we can take a shot at fixing it. It looks as if the prices of the 14 4-bedroom houses did indeed end up in beds4, and the prices of the 23 3-bedroom houses did indeed end up under beds3. So the columns are all right. But pivot_wider has gone wrong in that there are not enough rows. (According to the hint I just gave, “when pivot-wider goes wrong” is the place to look for ideas.)

The warning message tries to help us: “the values from price are not uniquely identified” means that pivot_wider cannot determine exactly where to put each price value. If we were not expecting this, the code at the bottom of the warning will help us diagnose. But see the next paragraph.

The problem is that there is nothing in asking_longer that says what row anything goes in, with the result that everything gets smooshed into one row, making two list-columns. So maybe the idea is to make a column of row numbers first, before we pivot wider. row_number() is handy for this:

asking_longer %>% 
  mutate(row = row_number()) %>% 
  pivot_wider(names_from = bedrooms, values_from = price)

Well, we have prices in separate cells now, but we seem to have gained some missing values. Where from? Well, each row in asking_longer is either the asking price of a 3-bed house or a 4-bed house but not both, so that each row of the result has the actual price that was in that row under the number of bedrooms it was for, and a missing for the other one. Worse, these are not the kind of missings that drop_na will help us with, since rows with missings also contain data that we want to keep.1

The idea that turns out to work involves some more careful thinking about row numbers. The problem above is that we need separate row numbers for each number of bedrooms, so that the idea we just had is nearly right but not quite. We want to do something (create row numbers) separately for each number of bedrooms, and it turns out that the multi-talented group_by is the key:

asking_longer %>% 
  group_by(bedrooms) %>% 
  mutate(row = row_number()) 

Now you see that the rows have been numbered within each number of bedrooms, so that the first row numbers are repeated for both numbers of bedrooms, and the last few only appear once (after we run out of 4-bedroom houses).

Now, the column called row is the only column not named in the pivot_wider, so it will be used to decide which row each asking price goes in:

asking_longer %>% 
  group_by(bedrooms) %>% 
  mutate(row = row_number()) %>% 
  pivot_wider(names_from = bedrooms, values_from = price)

and now it works. The missings in beds4 are there because there were only 14 4-bedroom houses, and so there are no values to go in rows 15 and on of that column. And, as you’ll recall, each column of a dataframe has to have the same number of rows.

3 Tidying basketball data

The data in http://ritsokiguess.site/datafiles/bb.txt contain some recent basketball results. The data, all in one column, are, for each game, the date the game was played, the road team’s name and score (separated by an underscore), and the home team’s name and score (also separated by an underscore). Our ultimate aim is to create a dataframe with one row for each game, and five columns: the date the game was played, the name of the road team, the road team’s score, the name of the home team, and the home team’s score. In this question, we build a pipeline to accomplish this, each part (except where stated) building on the part before.

  1. Read the data into a one-column dataframe (using, for example, read_csv), without using the first line as column names, and take a look at some of what you read in.

  2. We need to set things up to use pivot_wider. In particular, we need to do two things: say which column each entry goes to, and which row it goes to. To set this up, use c to make a vector called my_cols containing the entries date, road and home, and display it. (This does not use (a).)

  3. The function rep repeats things. It has up to four inputs (you normally use only the first one and one of the other three): the thing to be repeated, the number of times to repeat it (called times), the length of the final result (called length.out), and the number of times to repeat each element of the input (called each). Using your vector my_cols, experiment with each one of the three optional inputs to see what they do. Also, experiment with specifying both times and each.

  4. Starting from the dataframe read in from the file, create a new column what that uses rep on your vector my_cols to repeat the latter enough times to be the same length as X1. Check that the dates, road teams and home teams match up. (This uses the results from (a) and (c).)

  5. Run pivot_wider to create three new columns called date, road and home. What has happened? Why?

  6. Use rep again to create another column that is the numbers 1, 2, 3,… each repeated three times. This will identify each game.

  7. Does pivot_wider work now?

  8. Use the appropriate kind of separate to create columns of road and home teams and scores.

Tidying basketball data: my solutions

The data in http://ritsokiguess.site/datafiles/bb.txt contain some recent basketball results. The data, all in one column, are, for each game, the date the game was played, the road team’s name and score (separated by an underscore), and the home team’s name and score (also separated by an underscore). Our ultimate aim is to create a dataframe with one row for each game, and five columns: the date the game was played, the name of the road team, the road team’s score, the name of the home team, and the home team’s score. In this question, we build a pipeline to accomplish this, each part (except where stated) building on the part before.

(a) Read the data into a one-column dataframe (using, for example, read_csv), without using the first line as column names, and take a look at some of what you read in.

There is only one column, so you can pretend that the “columns” are separated by anything, like a comma. To not read the first row as the name of the first column, set col_names to be FALSE:

my_url <- "http://ritsokiguess.site/datafiles/bb.txt"
nba <- read_csv(my_url, col_names = FALSE)
Rows: 48 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): X1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nba

The one column gets the name X1. There are 48 rows, which means there are this many games:

48/3
[1] 16

and you can see that the entries in X1 are a date, a team and its score, another team and its score, repeated.

(b) We need to set things up to use pivot_wider. In particular, we need to do two things: say which column each entry goes to, and which row it goes to. To set this up, use c to make a vector called my_cols containing the entries date, road and home, and display it. (This does not use (a).)

Just this, for the moment:

my_cols <- c("date", "road", "home")
my_cols
[1] "date" "road" "home"

(c) The function rep repeats things. It has up to four inputs (you normally use only the first one and one of the other three): the thing to be repeated, the number of times to repeat it (called times), the length of the final result (called length.out), and the number of times to repeat each element of the input (called each). Using your vector my_cols, experiment with each one of the three optional inputs to see what they do. Also, experiment with specifying both times and each.

Here are some examples of what you might try (your numbers don’t have to be the same as mine):

rep(my_cols, times = 2)
[1] "date" "road" "home" "date" "road" "home"

The whole thing my_cols is repeated twice.

rep(my_cols, length.out = 12)
 [1] "date" "road" "home" "date" "road" "home" "date" "road" "home" "date"
[11] "road" "home"

my_cols is repeated enough times to make a total length of 12 (that is, four times).

rep(my_cols, each = 4)
 [1] "date" "date" "date" "date" "road" "road" "road" "road" "home" "home"
[11] "home" "home"

This repeats each thing in my_cols four times before moving on to the next one. If you combine times and each, this is what happens:

rep(my_cols, times = 2, each = 3)
 [1] "date" "date" "date" "road" "road" "road" "home" "home" "home" "date"
[11] "date" "date" "road" "road" "road" "home" "home" "home"

This repeats each element of my_cols three times (giving the first nine values), and then does that whole thing twice, giving 18 values in all. That has the same effect as this:

rep(my_cols, length.out = 18, each = 3)
 [1] "date" "date" "date" "road" "road" "road" "home" "home" "home" "date"
[11] "date" "date" "road" "road" "road" "home" "home" "home"

that is to say, it repeats each value three times, and then repeats the whole thing enough times to give a total length of 18.

(d) Starting from the dataframe read in from the file, create a new column what that uses rep on your vector my_cols to repeat the latter enough times to be the same length as X1. Check that the dates, road teams and home teams match up. (This uses the results from (a) and (c).)

We know there are 48 rows, so it is easiest to use length.out thus:

nba %>% 
  mutate(what = rep(my_cols, length.out = 48))

These do indeed match what they are.

Alternatively, if you did that division sum to work out how many games there were, use times:

nba %>% 
  mutate(what = rep(my_cols, times = 16))

If you get the wrong value in length.out or times, this kind of thing will happen:

nba %>% 
  mutate(what = rep(my_cols, times = 10))
Error in `mutate()`:
ℹ In argument: `what = rep(my_cols, times = 10)`.
Caused by error:
! `what` must be size 48 or 1, not 30.

This comes back to the fact that all the columns of a dataframe must have the same length, and my_cols repeated 10 times has only 30 values in it, not 48 as it should to match up.

(e) Run pivot_wider to create three new columns called date, road and home. What has happened? Why?

The inevitable try-it-and-see:

nba %>% 
  mutate(what = rep(my_cols, length.out = 48)) %>% 
  pivot_wider(names_from = what, values_from = X1)
Warning: Values from `X1` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(what) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

We have three list-columns of length 16, and only one row, characteristic of too much data. This happened because there was nothing to say which row each piece of game info should go into. We fix that in the next part.

(f) Use rep again to create another column that is the numbers 1, 2, 3,… each repeated three times. This will identify each game.

This is each: each = 3 to repeat each number three times. How far up to go? Well, either note again that there are 16 games, or that the whole thing must be 48 long and each number is repeated three times, so it must go up to 16:

nba %>% 
  mutate(what = rep(my_cols, times = 16)) %>% 
  mutate(my_row = rep(1:16, each = 3))

(g) Does pivot_wider work now?

You know what to do now. This is literally two copy-and-pastes:

nba %>% 
  mutate(what = rep(my_cols, times = 16)) %>% 
  mutate(my_row = rep(1:16, each = 3)) %>% 
  pivot_wider(names_from = what, values_from = X1)

This looks a whole lot better.

(h) Use the appropriate kind of separate to create columns of road and home teams and scores.

The things we want to split up are separated by a “delimiter character”, namely an underscore, so separate_wider_delim is what we want. Each thing to be split up has exactly one underscore in it, so there is no funny business here at all. Try one to make sure it works:

nba %>% 
  mutate(what = rep(my_cols, times = 16)) %>% 
  mutate(my_row = rep(1:16, each = 3)) %>% 
  pivot_wider(names_from = what, values_from = X1) %>% 
  separate_wider_delim(road, "_", names = c("road_team", "road_score"))

and then do the other one the same way:

nba %>% 
  mutate(what = rep(my_cols, times = 16)) %>% 
  mutate(my_row = rep(1:16, each = 3)) %>% 
  pivot_wider(names_from = what, values_from = X1) %>% 
  separate_wider_delim(road, "_", names = c("road_team", "road_score")) %>% 
  separate_wider_delim(home, "_", names = c("home_team", "home_score")) 

and we seem to have done it. (Maybe not quite: see the Extras.)

Extra 1: the two team scores are still text rather than numbers, which we can fix like this:

nba %>% 
  mutate(what = rep(my_cols, times = 16)) %>% 
  mutate(my_row = rep(1:16, each = 3)) %>% 
  pivot_wider(names_from = what, values_from = X1) %>% 
  separate_wider_delim(road, "_", names = c("road_team", "road_score")) %>% 
  separate_wider_delim(home, "_", names = c("home_team", "home_score")) %>% 
  mutate(road_score = as.numeric(road_score)) %>% 
  mutate(home_score = as.numeric(home_score))

Extra 2: we should probably also fix the dates. This is not quite as simple as the one in lecture, because these are not year-month-day. To fix these, we use something from the lubridate package. This can handle dates in a lot of different formats. In particular, there are functions with y, m and d in some order, and the one you use depends on how your dates are arranged: here, it is month, day, year, so we need to use mdy:2

nba %>% 
  mutate(what = rep(my_cols, times = 16)) %>% 
  mutate(my_row = rep(1:16, each = 3)) %>% 
  pivot_wider(names_from = what, values_from = X1) %>% 
  separate_wider_delim(road, "_", names = c("road_team", "road_score")) %>% 
  separate_wider_delim(home, "_", names = c("home_team", "home_score")) %>% 
  mutate(road_score = as.numeric(road_score)) %>% 
  mutate(home_score = as.numeric(home_score)) %>% 
  mutate(date = mdy(date)) -> nba_tidy
nba_tidy

You could now get rid of my_row, which has served its purpose.

Extra 3: now we can do stuff like making a graph of the road score vs the home score:

ggplot(nba_tidy, aes(x = road_score, y = home_score)) + geom_point()

There seems, apart from those two points bottom right,3 to be mostly an upward trend, which you might find surprising (or not).

Or, you can find the highest score by a home team, and the game it was in:

nba_tidy %>% slice_max(home_score, n = 1)

Both of these, you will note, depend on our having turned the scores into numbers.

Footnotes

  1. The expression is “throwing the baby out with the bathwater”, which contains the rather ghoulish image of what happens when you have given the baby a bath and you toss out the dirty water afterwards not very carefully. But I am writing this on Halloween, so I am allowed to be ghoulish.↩︎

  2. Which, you will note, also handles the named months.↩︎

  3. These were big wins by the road team.↩︎