Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.
For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!
1 Admissions to graduate departments at Berkeley
The University of California Berkeley is a very prestigious institution. Only a small fraction of students that apply to study there are accepted. One year, there was concern that females were being discriminated against in their applications, so the number of males and females admitted and rejected by six different graduate departments (labelled A through F) was recorded. The data are in http://ritsokiguess.site/datafiles/berkeley.txt, but are not very tidy. The data values are separated by tabs.
Read in and display the data. Describe briefly what you see, and what a tidy version of the data would look like.
Tidy the data by pivoting longer and then sorting out the gender and admission status.
Do the same tidying using only one pivot-longer.
Starting again from the dataframe you read in from the file, how can you arrange it to have two columns of frequencies, one with the number of students admitted, and one with the number of students rejected? (There are two ways, one of which is made by fixing up the dataframe of the previous part, and the other of which is directly made with one pivot-longer.) Save the dataframe that comes out of this.
The rest of the way, we investigate the possible gender discrimination, using the tidy dataframe we made in (d). (You may wish to skip the rest of this question on first reading and come back to it later.) First, work out the overall proportion of males and females that were admitted out of all the applicants.
Repeat the previous part, but for each department separately. (The modification to your code is a very small one). Are your results from (e) and (f) consistent, or not?
Admissions to graduate departments at Berkeley: my solutions
The University of California Berkeley is a very prestigious institution. Only a small fraction of students that apply to study there are accepted. One year, there was concern that females were being discriminated against in their applications, so the number of males and females admitted and rejected by six different graduate departments (labelled A through F) was recorded. The data are in http://ritsokiguess.site/datafiles/berkeley.txt, but are not very tidy. The data values are separated by tabs.
(a) Read in and display the data. Describe briefly what you see, and what a tidy version of the data would look like.
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Dept
dbl (4): Males:admitted, Males:rejected, Females:admitted, Females:rejected
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions
We have a column labelling the departments, and four columns of frequencies showing the numbers of males and females admitted and rejected by each of those departments. These are not tidy because each frequency is really one observation, so we would like to get a column of frequencies with other columns saying what department, gender and admission status that frequency goes with.
(As it turns out, we will find that a different form of “tidy” will be easier to address discrimination with.)
I suspect these values came out separated by tabs because I copied and pasted them from a website.
(b) Tidy the data by pivoting longer and then sorting out the gender and admission status.
The usual with these is to do the pivot-longer first and see what you have then:
That’s an improvement, with the frequencies all in one column, but gender and admission status are glued together and so we need to separate them out. They are separated by a particular character (a colon), so separate_wider_delim will split them up:
Note that when you have more than one names_to, you also need to say what they are separated by, in names_sep. My habit is to specify all the names stuff first, and then get to the values stuff.
This is actually tidy now, and so the usual thing is to save this one, but the dataframe we get in the next part turns out to be easier for us to use, and so we will save that one instead.
(d) Starting again from the dataframe you read in from the file, how can you arrange it to have two columns of frequencies, one with the number of students admitted, and one with the number of students rejected? (There are two ways, one of which is made by fixing up the dataframe of the previous part, and the other of which is directly made with one pivot-longer.) Save the dataframe that comes out of this.
You are (I hope) feeling a little uneasy about pivoting longer and then immediately wider again (and, indeed, pivoting wider one of the same things we pivoted longer). Surely there must be a way of doing it in one step. There is, but it is not obvious. Return to the data we read from the file:
admissions
and see that we want to do some kind of pivot-longer, with the males and females going into a column called gender, but we want to use admitted and rejected as column names, which is different. This last is accomplished by the special code .value in names_to:
in one step! (I have to admit that I feel like a genius when I do this and it comes out right!)
I saved this one. I called it ar, to stand for “admitted and rejected”. No doubt you have a better name.
(e) The rest of the way, we investigate the possible gender discrimination, using the tidy dataframe we made in (d). (You may wish to skip the rest of this question on first reading and come back to it later.) First, work out the overall proportion of males and females that were admitted out of all the applicants.
This works better with the dataframe I called ar, because you can see how to work out proportions: as admitted divided by admitted plus rejected. A good strategy is to do that at the end, after totalling up the counts you want to make proportions out of. That goes in two steps. First, sum up the applicants admitted and rejected by gender:
ar %>%group_by(gender) %>%summarize(admitted_total =sum(admitted), rejected_total =sum(rejected))
and then, having done that, turn these into proportions:
Here is where the discrimination claim came from: 45% of males were admitted, but only 30% of females. That seems like a big difference, and it appears that something is going on.
(f) Repeat the previous part, but for each department separately. (The modification to your code is a very small one). Are your results from (e) and (f) consistent, or not?
The modification is to add Dept to the group_by. As to where: we want to get to the proportions of males and females admitted within each department, so we add the department to the group_byfirst, thus:
`summarise()` has grouped output by 'Dept'. You can override using the
`.groups` argument.
Remember that, overall, males were quite a bit more likely to be admitted than females, and we were worried about discrimination. But here, in four out of the six departments, females were more likely to be admitted than males, and in the other two departments, males have only a small edge over females. This seems entirely inconsistent with the overall picture: how can more males be admitted in one, and more females be admitted in the other? How is this even possible?
This is actually Simpson’s Paradox: by aggregating (summing up over all departments), males come out better, but by looking at departments separately, the females come out better. It seems as if it should be impossible.
There are actually two parts to the explanation:
First, the above shows that the departments differ considerably in how difficult it is to get admitted. Department A accepts most people that apply there, but Department F accepts almost no-one (let me guess: I bet it’s the medical school). So you should be feeling uneasy about combining these very different departments.
Second, let’s work out how many total applications there were by males and females to each department:
(code note: the reason for the pivot-wider was to put males and females side by side for each department, and the reason for the select before that was to make the pivot-wider work properly.)
A lot more males than females applied to departments A and B, where most of the applicants were admitted. But more females than males applied to departments C and E, where it was harder to be admitted.
In other words, the reason that more males than females got admitted overall is because they tended to apply to departments where more people got admitted! Once you look at each department individually (rather than combining over the very different departments), the discrimination completely goes away.
Another statistical way of thinking about this is that an applicant’s chance of getting admitted depends (possibly) on whether they are male or female, but also on what department they are applying to, and you need to consider both of these to come to a justifiable conclusion.
2 House prices revisited
Earlier, we rearranged data on house prices from multiple columns into the one column of prices we needed to make a boxplot. In this question, we go the other way: starting from the dataframe used to make the boxplot, can we turn it into the dataframe that we read in from the file? The original “longer” data, or something very like it, is in http://ritsokiguess.site/datafiles/homes_long.csv. I suggest you try and see how far you can get (hints: pivot_wider, organizing rows) before looking at what I did.
Read in and display (some of) the data.
Rearrange the data to make it look something like what you read in in the earlier question (that is, with separate columns for 3-bedroom houses and 4-bedroom houses, each column containing prices for houses with that many bedrooms).
House prices revisited: My solutions
Earlier question, we rearranged data on house prices from multiple columns into the one column of prices we needed to make a boxplot. In this question, we go the other way: starting from the dataframe used to make the boxplot, can we turn it into the dataframe that we read in from the file? The original “longer” data, or something very like it, is in http://ritsokiguess.site/datafiles/homes_long.csv. I suggest you try and see how far you can get (hints: pivot_wider, organizing rows) before looking at what I did.
Read in and display (some of) the data.
Solution
Let’s start with our “longer” dataframe in the file:
Rows: 37 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): bedrooms
dbl (1): price
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
asking_longer
That’s about where we finished before, and that would be a suitable dataframe for making a boxplot or running a two-sample \(t\)-test. The two columns are called price and bedrooms.
(I changed things around a little; the bedrooms column is called something more memorable, and the values in it are text rather than numbers.)
Rearrange the data to make it look something like what you read in in the earlier question (that is, with separate columns for 3-bedroom houses and 4-bedroom houses, each column containing prices for houses with that many bedrooms).
Solution
So, turning it from longer to wider is pivot_wider. What about this, then?
Warning: Values from `price` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(bedrooms) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
Well, that did… something. But it is instructive to work out what happened, so that we can take a shot at fixing it. It looks as if the prices of the 14 4-bedroom houses did indeed end up in beds4, and the prices of the 23 3-bedroom houses did indeed end up under beds3. So the columns are all right. But pivot_wider has gone wrong in that there are not enough rows. (According to the hint I just gave, “when pivot-wider goes wrong” is the place to look for ideas.)
The warning message tries to help us: “the values from price are not uniquely identified” means that pivot_wider cannot determine exactly where to put each price value. If we were not expecting this, the code at the bottom of the warning will help us diagnose. But see the next paragraph.
The problem is that there is nothing in asking_longer that says what row anything goes in, with the result that everything gets smooshed into one row, making two list-columns. So maybe the idea is to make a column of row numbers first, before we pivot wider. row_number() is handy for this:
Well, we have prices in separate cells now, but we seem to have gained some missing values. Where from? Well, each row in asking_longer is either the asking price of a 3-bed house or a 4-bed house but not both, so that each row of the result has the actual price that was in that row under the number of bedrooms it was for, and a missing for the other one. Worse, these are not the kind of missings that drop_na will help us with, since rows with missings also contain data that we want to keep.1
The idea that turns out to work involves some more careful thinking about row numbers. The problem above is that we need separate row numbers for each number of bedrooms, so that the idea we just had is nearly right but not quite. We want to do something (create row numbers) separately for each number of bedrooms, and it turns out that the multi-talented group_by is the key:
Now you see that the rows have been numbered within each number of bedrooms, so that the first row numbers are repeated for both numbers of bedrooms, and the last few only appear once (after we run out of 4-bedroom houses).
Now, the column called row is the only column not named in the pivot_wider, so it will be used to decide which row each asking price goes in:
and now it works. The missings in beds4 are there because there were only 14 4-bedroom houses, and so there are no values to go in rows 15 and on of that column. And, as you’ll recall, each column of a dataframe has to have the same number of rows.
3 Tidying basketball data
The data in http://ritsokiguess.site/datafiles/bb.txt contain some recent basketball results. The data, all in one column, are, for each game, the date the game was played, the road team’s name and score (separated by an underscore), and the home team’s name and score (also separated by an underscore). Our ultimate aim is to create a dataframe with one row for each game, and five columns: the date the game was played, the name of the road team, the road team’s score, the name of the home team, and the home team’s score. In this question, we build a pipeline to accomplish this, each part (except where stated) building on the part before.
Read the data into a one-column dataframe (using, for example, read_csv), without using the first line as column names, and take a look at some of what you read in.
We need to set things up to use pivot_wider. In particular, we need to do two things: say which column each entry goes to, and which row it goes to. To set this up, use c to make a vector called my_cols containing the entries date, road and home, and display it. (This does not use (a).)
The function rep repeats things. It has up to four inputs (you normally use only the first one and one of the other three): the thing to be repeated, the number of times to repeat it (called times), the length of the final result (called length.out), and the number of times to repeat each element of the input (called each). Using your vector my_cols, experiment with each one of the three optional inputs to see what they do. Also, experiment with specifying both times and each.
Starting from the dataframe read in from the file, create a new column what that uses rep on your vector my_cols to repeat the latter enough times to be the same length as X1. Check that the dates, road teams and home teams match up. (This uses the results from (a) and (c).)
Run pivot_wider to create three new columns called date, road and home. What has happened? Why?
Use rep again to create another column that is the numbers 1, 2, 3,… each repeated three times. This will identify each game.
Does pivot_wider work now?
Use the appropriate kind of separate to create columns of road and home teams and scores.
Tidying basketball data: my solutions
The data in http://ritsokiguess.site/datafiles/bb.txt contain some recent basketball results. The data, all in one column, are, for each game, the date the game was played, the road team’s name and score (separated by an underscore), and the home team’s name and score (also separated by an underscore). Our ultimate aim is to create a dataframe with one row for each game, and five columns: the date the game was played, the name of the road team, the road team’s score, the name of the home team, and the home team’s score. In this question, we build a pipeline to accomplish this, each part (except where stated) building on the part before.
(a) Read the data into a one-column dataframe (using, for example, read_csv), without using the first line as column names, and take a look at some of what you read in.
There is only one column, so you can pretend that the “columns” are separated by anything, like a comma. To not read the first row as the name of the first column, set col_names to be FALSE:
Rows: 48 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): X1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nba
The one column gets the name X1. There are 48 rows, which means there are this many games:
48/3
[1] 16
and you can see that the entries in X1 are a date, a team and its score, another team and its score, repeated.
(b) We need to set things up to use pivot_wider. In particular, we need to do two