Ken's Blog: Two header rows and other spreadsheets

Ken Butler

Packages

library(tidyverse)

Introduction

A friend tells you that they are trying to find out which combination of detergent and temperature gets the most dirt off their laundry. They send you a spreadsheet that looks like this:

The first row is the name of the detergent (only named once), and the second row is the washing temperature. Below that is the amount of dirt removed from each of four loads of laundry washed under those conditions. (You know that your friend knows something about statistics and would have been careful to randomize loads of laundry to treatments.)

This is not going to be very helpful to you because it has two header rows. Fortunately Alison Hill has a blog post on almost this thing, which we can steal. In hers, the first row was variable names and the second was variable descriptions (which she used to make a data dictionary). Here, though, the column names need to be made out of bits of both rows.

Making column names

The strategy is the same as Alison used (so I’m claiming very little originality here): read the header lines and make column names out of them, then read the rest of the data with the column names that we made.

Your friend supplied you with a .csv file (they do have some training, after all):

my_file <- "https://raw.githubusercontent.com/nxskok/nxskok.github.io/master/static/detergent.csv"
headers <- read_csv(my_file, col_names=F, n_max=2)
headers

# A tibble: 2 × 6
  X1    X2    X3    X4    X5    X6   
  <chr> <chr> <chr> <chr> <chr> <chr>
1 Super <NA>  <NA>  Best  <NA>  <NA> 
2 Cold  Warm  Hot   Cold  Warm  Hot

Couple of things here: we want read_csv to supply some dummy column names, and we want to read only two rows.

To use this, we want to construct some column names, but to do this it will be much easier if we have six rows and a few columns. For me, this is an everything-looks-like-a-nail moment, and I reach for gather, and then stop myself just in time to use pivot_longer instead. To keep things straight, I’m going to make a new column first so that I know what is what, and then use the default column names name and value in pivot_longer until I figure out what I’m doing:

headers %>% mutate(what=c("detergent", "temperature")) %>% 
  pivot_longer(-what)

# A tibble: 12 × 3
   what        name  value
   <chr>       <chr> <chr>
 1 detergent   X1    Super
 2 detergent   X2    <NA> 
 3 detergent   X3    <NA> 
 4 detergent   X4    Best 
 5 detergent   X5    <NA> 
 6 detergent   X6    <NA> 
 7 temperature X1    Cold 
 8 temperature X2    Warm 
 9 temperature X3    Hot  
10 temperature X4    Cold 
11 temperature X5    Warm 
12 temperature X6    Hot

So now it looks as if I want to pivot_wider that column what, getting the values from value. (At this point, I feel a nagging doubt that I can do this with one pivot_longer, but anyway):¹

headers %>% mutate(what=c("detergent", "temperature")) %>% 
  pivot_longer(-what) %>% 
  pivot_wider(names_from=what, values_from=value) -> d1
d1

# A tibble: 6 × 3
  name  detergent temperature
  <chr> <chr>     <chr>      
1 X1    Super     Cold       
2 X2    <NA>      Warm       
3 X3    <NA>      Hot        
4 X4    Best      Cold       
5 X5    <NA>      Warm       
6 X6    <NA>      Hot

Much better. Next, I need to fill those missing values in detergent, and then I glue those two things together to make my column names:

d1 %>% fill(detergent) %>% 
  mutate(mycol=str_c(detergent, temperature, sep="_")) -> d2
d2

# A tibble: 6 × 4
  name  detergent temperature mycol     
  <chr> <chr>     <chr>       <chr>     
1 X1    Super     Cold        Super_Cold
2 X2    Super     Warm        Super_Warm
3 X3    Super     Hot         Super_Hot 
4 X4    Best      Cold        Best_Cold 
5 X5    Best      Warm        Best_Warm 
6 X6    Best      Hot         Best_Hot

and then grab my desired column names as a vector:

d2 %>% pull(mycol) -> my_col_names

Constructing the data frame with the rest of the data

Now we need to read the actual data, which means skipping the first two rows, and while doing so, use the column names we made as column names for the data frame (Alison’s idea again):

laundry <- read_csv(my_file, skip=2, col_names=my_col_names)
laundry

# A tibble: 4 × 6
  Super_Cold Super_Warm Super_Hot Best_Cold Best_Warm Best_Hot
       <dbl>      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
1          4          7        10         6        13       12
2          5          9        12         6        15       13
3          6          8        11         4        12       10
4          5         12         9         4        12       13

Looking good so far.

We need to make this longer to do anything useful with it. Each column name encodes two things: a detergent name and a temperature, and this can be made longer in one shot by using two things in names_to in pivot_longer. This means I also have to say what those two names are separated by (which I forgot the first time, but the error message was helpful):

laundry %>% 
  pivot_longer(everything(), names_to=c("detergent", "temperature"), 
               names_sep="_", 
               values_to="dirt_removed") -> laundry_tidy
laundry_tidy

# A tibble: 24 × 3
   detergent temperature dirt_removed
   <chr>     <chr>              <dbl>
 1 Super     Cold                   4
 2 Super     Warm                   7
 3 Super     Hot                   10
 4 Best      Cold                   6
 5 Best      Warm                  13
 6 Best      Hot                   12
 7 Super     Cold                   5
 8 Super     Warm                   9
 9 Super     Hot                   12
10 Best      Cold                   6
# … with 14 more rows

Success.

A plot

There are four observations per combination of detergent and temperature, so that devotees of ANOVA among you will know that we can test for a significant interaction effect between detergent and temperature on the amount of dirt removed. (That is to say, the effect of temperature on dirt removed might be different for each detergent, and we have enough data to see whether that is indeed the case “for all laundry loads”.)

To see whether this is likely, we can make an interaction plot: plot the mean dirt removed for each temperature, separately for each detergent, and then join the results for each temperature by lines (coloured by detergent). This can be done by first making a data frame of means using group_by and summarize, or like this:

ggplot(laundry_tidy, aes(x=fct_inorder(temperature), y=dirt_removed, colour=detergent, group=detergent)) + 
  stat_summary(fun.y=mean, geom="point") +
  stat_summary(fun.y=mean, geom="line")

Code-wise, the last two lines are a kind of funky geom_point and geom_line, except that instead of plotting the actual amounts of dirt removed, we plot the mean dirt removed each time. (The fct_inorder plots the temperatures in the sensible order that they appear in the data, rather than alphabetical order.)

Statistically, if the two traces are more or less parallel, the two factors detergent and temperature act independently on the amount of dirt removed. But that is not the case here: a warm temperature is the best for Best detergent, while a hot temperature is best for Super detergent.²

As in actual website

So I lied to you (for the purpose of telling a story, but I hope a useful one).

Here’s how the data were actually laid out:

Detergent    Cold         Warm          Hot
Super     4,5,6,5     7,9,8,12   10,12,11,9
Best      6,6,4,4  13,15,12,12  12,13,10,13

Let’s see whether we can tell a different story by getting these data tidy. (I added the word Detergent to the top left cell to make our lives slightly easier.)

First, this is column-aligned data, so we need read_table:

my_file="https://raw.githubusercontent.com/nxskok/nxskok.github.io/master/static/laundry.txt"
laundry_2 <- read_table(my_file, col_types=cols(
  Cold=col_character(),
  Warm=col_character(),
  Hot=col_character()
))
laundry_2

# A tibble: 2 × 4
  Detergent Cold    Warm        Hot        
  <chr>     <chr>   <chr>       <chr>      
1 Super     4,5,6,5 7,9,8,12    10,12,11,9 
2 Best      6,6,4,4 13,15,12,12 12,13,10,13

My first go at this turned out to treat the comma as a thousands separator (which was then dropped), so the top left cell got read as the number 4565. This use of col_types forces the columns to be text, so they get left alone.

So now, a standard pivot_longer to begin:

laundry_2 %>% pivot_longer(-Detergent, names_to="Temperature", values_to="Dirt_removed")

# A tibble: 6 × 3
  Detergent Temperature Dirt_removed
  <chr>     <chr>       <chr>       
1 Super     Cold        4,5,6,5     
2 Super     Warm        7,9,8,12    
3 Super     Hot         10,12,11,9  
4 Best      Cold        6,6,4,4     
5 Best      Warm        13,15,12,12 
6 Best      Hot         12,13,10,13

We have several values for dirt removed, separated by commas. We could use separate to create four new columns and pivot them longer as well. But there is a better way:

laundry_2 %>% pivot_longer(-Detergent, names_to="Temperature", values_to="Dirt_removed") %>% 
  separate_rows(Dirt_removed, convert=T)

# A tibble: 24 × 3
   Detergent Temperature Dirt_removed
   <chr>     <chr>              <int>
 1 Super     Cold                   4
 2 Super     Cold                   5
 3 Super     Cold                   6
 4 Super     Cold                   5
 5 Super     Warm                   7
 6 Super     Warm                   9
 7 Super     Warm                   8
 8 Super     Warm                  12
 9 Super     Hot                   10
10 Super     Hot                   12
# … with 14 more rows

This brings us back to where we were before. A couple of notes about separate_rows:

it puts each separated value on a new row, and so is a combined separate and pivot_longer.
the default separator between values is everything non-alphanumeric except for a dot. That includes a comma, so we don’t have to do anything special.
convert=T says to turn the separated values into whatever they look like (here numbers).

From here, we can proceed as before with plots, ANOVA or whatever.

References

I don’t actually think I can here. I was thinking of .value, but that is used when the names of the columns that I’m making longer contain the names of new columns in them.↩︎
There are always two ways to express an interaction effect. The other one here is that the two detergents are pretty similar except at warm water temperatures, where Best is a lot better.↩︎

Two header rows and other spreadsheets