Worksheet 8

Published

March 15, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There might be extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

One of the questions (the second one) is a repeat of the one on the end of Worksheet 7 (that you may not have gotten to then).

1 Pivoting

In this question, you are given some dataframes and a bit of code, and your job is to decide what the code does and what its output will be, before you run it and find out.

Starting from this dataframe:

d1 <- tribble(
  ~x, ~y, ~z,
  10, 11, 13,
  9, 12, 14
)
d1

what will happen if you run this code?

d1 %>% pivot_longer(y:z, names_to = "name", values_to = "value")

Starting from here:

d2 <- tribble(
  ~x, ~y_first, ~y_second, ~y_third,
  "a", 10, 11, 13,
  "b", 12, 15, 16
)
d2

this code is run. What is the output?

d2 %>% 
  pivot_longer(starts_with("y"), names_to = c(".value", "when"), names_sep = "_")

Starting from here:

d3 <- tribble(
  ~id, ~x, ~y,
  1, "a", 10,
  2, "a", 11,
  1, "b", 12,
  2, "b", 13
)
d3

what happens if you run this?

d3 %>% pivot_wider(names_from = x, values_from = y)

Starting from here:

d4 <- tribble(
  ~x, ~y, ~z,
  "a", "b", 9,
  "c", "d", 10,
  "a", "d", 11,
  "c", "b", 12
)
d4

what happens if you run this?

d4 %>% pivot_wider(names_from = y, values_from = z)

Pivoting: my solutions

In this question, you are given some dataframes and a bit of code, and your job is to decide what the code does and what its output will be, before you run it and find out.

Starting from this dataframe:

d1 <- tribble(
  ~x, ~y, ~z,
  10, 11, 13,
  9, 12, 14
)
d1

what will happen if you run this code?

d1 %>% pivot_longer(y:z, names_to = "name", values_to = "value")

Solution

In this question, the solution is to try it and see, and if it comes out differently to what you expected, try to work out why:

d1 %>% pivot_longer(y:z, names_to = "name", values_to = "value")

The two columns y and z are pivoted longer, so the column name contains the name of the column from which the value came, and the column value contains the value. The values in x are repeated as necessary to match up with the y and z they originally went with.

Here and elsewhere in this question, the order of the rows is not important, as long as the values match up. In this one, y in name goes with 11 and 12 in value; one of the y values goes with an x of 10 and the other with 9.

\(\blacksquare\)

Starting from here:

d2 <- tribble(
  ~x, ~y_first, ~y_second, ~y_third,
  "a", 10, 11, 13,
  "b", 12, 15, 16
)
d2

this code is run. What is the output?

d2 %>% 
  pivot_longer(starts_with("y"), names_to = c(".value", "when"), names_sep = "_")

Solution

d2

d2 %>% 
  pivot_longer(starts_with("y"), names_to = c(".value", "when"), names_sep = "_")

The .value says to create a column named y (the first bit is always y) filled with the values in the dataframe, and then to create a column called when filled with the second part of the names of the columns in d4. The column x not mentioned is repeated as needed to match up with the other values.

\(\blacksquare\)

Starting from here:

d3 <- tribble(
  ~id, ~x, ~y,
  1, "a", 10,
  2, "a", 11,
  1, "b", 12,
  2, "b", 13
)
d3

what happens if you run this?

d3 %>% pivot_wider(names_from = x, values_from = y)

Solution

This is the “easy” version of pivot_wider that is literally the flip side of pivoting longer:

d3

d3 %>% pivot_wider(names_from = x, values_from = y)

The values in x are a and b, so x and y are replaced by new columns called a and b; the values in these are the ones in y. The rows into which the values of y go are actually the ones in id, but you can (here) successfully guess that they are in the same order as the original data, because the values in id also are.

\(\blacksquare\)

Starting from here:

d4 <- tribble(
  ~x, ~y, ~z,
  "a", "b", 9,
  "c", "d", 10,
  "a", "d", 11,
  "c", "b", 12,
  "a", "e", 13
)
d4

what happens if you run this?

d4 %>% pivot_wider(names_from = y, values_from = z)

Solution

d4

d4 %>% pivot_wider(names_from = y, values_from = z)

To work out why that:

The names of the new columns are coming from y, so the new columns will be called b and d. The values to go into them will be the ones in z, so the 9 and 12 will go into the b column and the 10 and 11 will go into d. The remaining column, x, will determine which rows each numeric value will go into. Thus the 9 and 11 will go into the row where x is a, and the 10 and 12 will go into the row where x is b. Finally, the last row of d4 is the only one where y is e, so that value will go into the a row (since x is a) and the value 13 will go into the e column. There is no value to go into the e column and c row, so the value there in the result is missing. Thus the result will be as shown.

An answer with rows or columns in a different order is fine, since the point of the question was to see whether you know what pivot_wider actually does (so that the values are in the right places relative to each other), rather than the nitty-gritty details of how things are ordered.

Extra: this is an easy kind of question to give you on an exam, when you would have to reason out what will happen to a dataframe when a pivot_longer or pivot_wider is run on it, or you might be given the input and output dataframes, and your job is then to say what code will get you from the input to the output.

\(\blacksquare\)

2 Home prices

A realtor kept track of the asking prices of 37 homes for sale in West Lafayette, Indiana, in a particular year. Unfortunately for us, the data are as shown in http://ritsokiguess.site/datafiles/homes.txt. The small numbers, 3 and 4, on lines by themselves, are the number of bedrooms of the homes whose asking prices follow that number. The actual asking prices, in dollars, are aligned in columns, up to six of them in a line.

We need to get these data into a tidy format, with one column being the number of bedrooms and the other being the asking price of the home.

The question has three parts, but you will spend most of your time and thinking on part (b). Part (c) is straightforward once you have done part (b).

Read in the (untidy) data from the file, in a way that you get all the data values. To make sure that R reads all six columns, set up a vector of six column names (it does not matter what they are) and tell the appropriate read_ function to use these as the column names. (You will have to look this last thing up. Say where you found the answer.)
Obtain a tidy dataframe, according to the specifications in the question. Describe your process clearly enough that anyone reading your answer would be able to do it themselves and see why it works.

Hints: you might find it useful to find out about the following, if you don’t already know about them:

drop_na
ifelse
fill

Also think about how you know whether a number in the data file is a selling price or a number of bedrooms.

Use your tidy dataframe to make a boxplot of these data.

Home prices: my solutions

We need to get these data into a tidy format, with one column being the number of bedrooms and the other being the asking price of the home.

The question has three parts, but you will spend most of your time and thinking on part (b). Part (c) is straightforward once you have done part (b).

Read in the (untidy) data from the file, in a way that you get all the data values. To make sure that R reads all six columns, set up a vector of six column names (it does not matter what they are) and tell the appropriate read_ function to use these as the column names. (You will have to look this last thing up. Say where you found the answer.)

Solution

These are aligned columns,¹ so read_table is what you need. Set up the column names first, any six distinct pieces of text. The way you tell read_table what to use as the column names (when you don’t want to use the first line of the data file) is col_names. If you read the help for read_table (by typing ?read_table in the console, or the same thing online), you’ll see that there is an optional input col_names whose default value is TRUE,² but which can be changed to FALSE (don’t read any column names at all, using X1, X2 etc.) or a collection of column names that you specify, which is what we want here. I thought this was in my lecture notes, but it appears not to be, not at least until much later.

my_cols <- c("a", "b", "c", "d", "e", "f")
my_url <- "http://ritsokiguess.site/datafiles/homes.txt"
asking0 <- read_table(my_url, col_names = my_cols)


── Column specification ────────────────────────────────────────────────────────
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_double(),
  e = col_double(),
  f = col_double()
)

Warning: 4 parsing failures.
row col  expected    actual                                           file
  1  -- 6 columns 1 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  4  -- 6 columns 2 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  5  -- 6 columns 1 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  9  -- 6 columns 5 columns 'http://ritsokiguess.site/datafiles/homes.txt'

asking0

It looks as if we have everything that was in the data file, along with a whole bunch of missings (that we will get rid of later). I have given this a temporary name, saving the good name for the eventual tidy dataframe. The rather scary-sounding “parsing failure” is because on four of the lines, there are fewer than the six values it was expecting (because we gave six column names); read_table supplied missing values for the values it was expecting and didn’t see, which is just fine by us.

If you don’t set the column names first, you’ll get this:

read_table(my_url)


── Column specification ────────────────────────────────────────────────────────
cols(
  `4` = col_double()
)

Warning: 7 parsing failures.
row col  expected    actual                                           file
  1  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  2  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  3  -- 1 columns 2 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  5  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  6  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
... ... ......... ......... ..............................................
See problems(...) for more details.

The problem is that the first row only has one data value on it (the 4 of 4 bedrooms), and so read_table will assume that there is only one data value in each row all the way through, which is not true. This way is actually wrong two ways: the column name 4, that was read from the first line of the data file, is actually data and not a column name.

Setting col_names = FALSE also gets you only one column instead of six, for the same reason.

The only way to tell read_table that there are actually six columns is to specify six column names. Well, not quite the only way. One of the other options in read_table is skip; if you skip the first line of the data file you will at least get six columns:

read_table(my_url, skip = 1, col_names = FALSE)


── Column specification ────────────────────────────────────────────────────────
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double(),
  X4 = col_double(),
  X5 = col_double(),
  X6 = col_double()
)

Warning: 3 parsing failures.
row col  expected    actual                                           file
  3  -- 6 columns 2 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  4  -- 6 columns 1 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  8  -- 6 columns 5 columns 'http://ritsokiguess.site/datafiles/homes.txt'

This at least gets all the house prices, but it does not get any indication that the first three rows belong to houses with four bedrooms.

\(\blacksquare\)

Obtain a tidy dataframe, according to the specifications in the question. Describe your process clearly enough that anyone reading your answer would be able to do it themselves and see why it works.

Hints: you might find it useful to find out about the following, if you don’t already know about them:

drop_na
ifelse
fill

Also think about how you know whether a number in the data file is a selling price or a number of bedrooms.

Solution

For the first step, we want to get all the prices in one column. This will unfortunately put the numbers of bedrooms in the same column, but we will separate those out shortly. Another way to say that is that the current six columns have no bearing on how the data should eventually end up: it makes no difference whether a data value is currently in column 1 (a) or 2 (b), for example. So, as is often the case, the first step is a pivot_longer:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price")

All of the prices, along with the numbers of bedrooms and a substantial number of missings, are in the column I called price.³

I decided to get rid of the missing values next; you could do it now or later:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price)

We are down to 39 rows, which is actually right because two of those supposed prices are actually numbers of bedrooms.

You can also do these two steps in one, by adding an option to pivot_longer:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price", values_drop_na = TRUE)

This is perhaps better. I used this idea in lecture (in the tuberculosis example).

So now we need to make a column that is going to contain the number of bedrooms. How do we tell that we are looking at a number of bedrooms? If it’s a small number, say less than 10, it’s a number of bedrooms, and if it’s not, it’s an asking price, and we don’t know (yet) how many bedrooms it goes with.

So my idea is to create a new column called, say, bedrooms, that is the number of bedrooms if that’s what we’re looking at, and missing otherwise.⁴ This is what ifelse⁵ does:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(bedrooms = ifelse(price < 10, price, NA))

In the description of the data, the asking prices that came after 4 were for houses with 4 bedrooms (149900, 169900, …, 269900). The other ones, after the 3, were for houses with 3 bedrooms. That means that the bunch of missing bedrooms after the 4 should be filled in with 4, and the ones after 3 should be filled in with 3. This is exactly how fill works. (I should probably admit that I suspected that this was coming, so I set it up this way with how I did the ifelse. If you found out about fill first, you might also have had the idea that having the other values in the bedrooms column be missing would be a good idea, with your thought process being something like “how can I set this up so that fill will work?”.) Thus:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(bedrooms = ifelse(price < 10, price, NA)) %>% 
  fill(bedrooms)

fill is a very helpful tool. Here, you really need something that looks upwards to see what the previous number of bedrooms was, which is not a tidyverse-friendly kind of thing to do. The only other way I can think of is something like this, which happens to work here:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(junk = ifelse(price < 10, price, 10)) %>%
  mutate(bedrooms = cummin(junk))

cummin is “cumulative minimum”, that is, the minimum seen so far, counting down from the top. This worked here because the four-bedroom houses were listed before the three-bedroom ones, and in my ifelse the value used in junk if the thing in price was not a number of bedrooms was something bigger than either 4 or 3. If the three-bedroom house prices had been listed first, instead of using 10 in junk I would have had to use something smaller like 0, and then use cummax instead of cummin. You can see that the fill idea is a lot more elegant and involves a lot less checking of things.

Nearly there now. Two more steps:

get rid of the rows that have a number of bedrooms rather than an asking price in them (which you can detect because they have a number less than 10 in them for price).
get rid of the column that has the old column names in it (the ones we supplied when we read in the file). It turns out that the only thing we needed them for was to read in the file properly; they served no other purpose:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(bedrooms = ifelse(price < 10, price, NA)) %>% 
  fill(bedrooms) %>% 
  filter(price>10) %>%
  select(-old_column) -> asking
asking

Success! 37 rows for the 37 homes, and the right two columns.

This is basically what I did to create the data file of these data for you (the one you used when you encountered this data earlier, to do some kind of two-sample test).

For you, at a minimum you need the code in my final pipeline just above, plus an explanation of what each line in it does and the logic behind your process. It’s up to you whether you do an explanation of the whole thing, or whether you run it up to a certain point and explain what it has done so far (which is the way I did it).

I’m expecting that this part will take you quite a bit of time, and may involve several false trails (ideas you thought might work but turned out not to). The process of tidying a new data set is more like solving a puzzle than anything else. (At least in this case you know that there is an answer in the end, so this one really is like a puzzle.)

\(\blacksquare\)

Use your tidy dataframe to make a boxplot of these data.

Solution

Now you get to reap the rewards of your hard work: something that’s now easy, or at least easy-ish. The number of bedrooms is the nearest thing we have to a categorical variable, except that it’s actually a number, so we have to make it categorical by passing it into factor:

ggplot(asking, aes(x = factor(bedrooms), y = price)) +
  geom_boxplot()

The asking prices for 4-bedroom homes are on average clearly bigger than for 3-bedroom homes.

as.character also works instead of factor, either above or below. If you are bothered about the funny label on the \(x\)-axis, redefine bedrooms from a number to text/factor before you draw the boxplot:

asking %>% 
  mutate(bedrooms = as.character(bedrooms)) %>% 
  ggplot(aes(x = bedrooms, y = price)) +
    geom_boxplot()

and now the \(x\)-axis has a nice label.

Extra: I think this was the one where, before, I had you take logs of the asking prices first, which made the boxes look more symmetric so that a \(t\)-test was then reasonable to do. These distributions are very much skewed to the right:

asking %>% 
  mutate(bedrooms = as.character(bedrooms)) %>% 
  ggplot(aes(sample = price)) +
    stat_qq() + stat_qq_line() + facet_wrap(~bedrooms)

The distribution of asking prices for 3-bedroom houses is definitely skewed to the right (lower values too bunched up, higher values too spread out). The distribution of asking prices for 4-bedroom houses is surprisingly close to normal; all that is non-normal about it is that the lower tail is too short. The boxplot for these houses suggested a more skewed distribution than this; in fact, another explanation for the boxplot you got is that the lower tail is short compared to the normal. This is a possible explanation for this type of distribution; not a common one, but a possible one. It goes to show that sometimes you need to look at the normal quantile plot to see what is really going on.

\(\blacksquare\)

Footnotes

At least the dataset is not that untidy!↩︎
Meaning, if you don’t specify it, the values from the first line of the file are used as column names.↩︎
There were actually 37 homes altogether, which gives you an idea of just how many missings we have. These are not actual missing data; they came from the blanks in the datafile, which indicate where there might have been data but there was actually none.↩︎
I am setting up for something later; you might have to do a bit of trial and error here.↩︎
Or, should you prefer, case_when.↩︎