Ken's Blog: Another tidying problem

Ken Butler

Introduction

Some cars have a computer that records gas mileage since the last time the computer was reset. A driver is concerned that the computer on their car is not as accurate as it might be, so they keep an old-fashioned notebook and record the miles driven since the last fillup, and the amount of gas filled up, and use that to compute the miles per gallon. They also record what the car’s computer says the miles per gallon was.

Is there a systematic difference between the computer’s values and the driver’s? If so, which way does it go?

Packages

library(tidyverse)

The data

The driver’s notebook has small pages, so the data look like this:

Fillup     1    2    3    4    5
Computer 41.5 50.7 36.6 37.3 34.2
Driver   36.5 44.2 37.2 35.6 30.5
Fillup     6    7    8    9   10
Computer 45.0 48.0 43.2 47.7 42.2
Driver   40.5 40.0 41.0 42.8 39.2
Fillup    11   12   13   14   15
Computer 43.2 44.6 48.4 46.4 46.8
Driver   38.8 44.5 45.4 45.3 45.7
Fillup    16   17   18   19   20
Computer 39.2 37.3 43.5 44.3 43.3
Driver   34.2 35.2 39.8 44.9 47.5

This is not very close to tidy. There are three variables: the fillup number (identification), the computer’s miles-per-gallon value, and the driver’s. These should be in columns, not rows. Also, there are really four sets of rows, because of the way the data was recorded. How are we going to make this tidy?

Making it tidy

As ever, we take this one step at a time, building a pipeline as we go: we see what each step produces before figuring out what to do next.

The first thing is to read the data in; these are aligned columns, so read_table is the thing. Also, there are no column headers, so we have to say that as well:

my_url <- "https://raw.githubusercontent.com/nxskok/nxskok.github.io/master/gas-mileage.txt"
gas <- read_table(my_url, col_names = FALSE)
gas

# A tibble: 12 × 6
   X1          X2    X3    X4    X5    X6
   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Fillup     1     2     3     4     5  
 2 Computer  41.5  50.7  36.6  37.3  34.2
 3 Driver    36.5  44.2  37.2  35.6  30.5
 4 Fillup     6     7     8     9    10  
 5 Computer  45    48    43.2  47.7  42.2
 6 Driver    40.5  40    41    42.8  39.2
 7 Fillup    11    12    13    14    15  
 8 Computer  43.2  44.6  48.4  46.4  46.8
 9 Driver    38.8  44.5  45.4  45.3  45.7
10 Fillup    16    17    18    19    20  
11 Computer  39.2  37.3  43.5  44.3  43.3
12 Driver    34.2  35.2  39.8  44.9  47.5

Longer first

I usually find it easier to make the dataframe longer first, and then figure out what to do next. Here, that means putting all the data values in one column, and having a column of variable names indicating what each variable is a value of, thus:

gas %>% pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value")

# A tibble: 60 × 3
   X1       var_name var_value
   <chr>    <chr>        <dbl>
 1 Fillup   X2             1  
 2 Fillup   X3             2  
 3 Fillup   X4             3  
 4 Fillup   X5             4  
 5 Fillup   X6             5  
 6 Computer X2            41.5
 7 Computer X3            50.7
 8 Computer X4            36.6
 9 Computer X5            37.3
10 Computer X6            34.2
# … with 50 more rows

The things in X1 are our column-names-to-be, and the values that go with them are in var_value. var_name has mostly served its purpose; these are the original columns in the data file, which we don’t need any more. So now, we make this wider, right?

gas %>% pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value") %>% 
  pivot_wider(names_from = X1, values_from = var_value)

# A tibble: 5 × 4
  var_name Fillup    Computer  Driver   
  <chr>    <list>    <list>    <list>   
1 X2       <dbl [4]> <dbl [4]> <dbl [4]>
2 X3       <dbl [4]> <dbl [4]> <dbl [4]>
3 X4       <dbl [4]> <dbl [4]> <dbl [4]>
4 X5       <dbl [4]> <dbl [4]> <dbl [4]>
5 X6       <dbl [4]> <dbl [4]> <dbl [4]>

Oh. How did we get list-columns?

The answer is that pivot_wider needs to know which column each var_value is going to, but also which row. The way it decides about rows is to look at all combinations of things in the other columns, the ones not involved in the pivot_wider. The only one of those here is var_name, so each value goes in the column according to its value in X1, and the row according to its value in var_name. For example, the value 41.5 in row 6 of the longer dataframe goes into the column labelled Computer and the row labelled X2. But if you scroll down the longer dataframe, you’ll find there are four data values with the Computer-X2 combination, so pivot_wider collects them together into one cell of the output dataframe.

This is what the warning is about.

spread handled this much less gracefully:

gas %>% pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value") %>% 
  spread(X1, var_value)

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 60 rows:
* 6, 21, 36, 51
* 7, 22, 37, 52
* 8, 23, 38, 53
* 9, 24, 39, 54
* 10, 25, 40, 55
* 11, 26, 41, 56
* 12, 27, 42, 57
* 13, 28, 43, 58
* 14, 29, 44, 59
* 15, 30, 45, 60
* 1, 16, 31, 46
* 2, 17, 32, 47
* 3, 18, 33, 48
* 4, 19, 34, 49
* 5, 20, 35, 50

It required a unique combination of values for the other variables in the dataframe, which in our case we have not got.

All right, back to this:

gas %>% pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value") %>% 
  pivot_wider(names_from = X1, values_from = var_value)

# A tibble: 5 × 4
  var_name Fillup    Computer  Driver   
  <chr>    <list>    <list>    <list>   
1 X2       <dbl [4]> <dbl [4]> <dbl [4]>
2 X3       <dbl [4]> <dbl [4]> <dbl [4]>
3 X4       <dbl [4]> <dbl [4]> <dbl [4]>
4 X5       <dbl [4]> <dbl [4]> <dbl [4]>
5 X6       <dbl [4]> <dbl [4]> <dbl [4]>

There is a mindless way to go on from here, and a thoughtful way.

The mindless way to handle unwanted list-columns is to throw an unnest at the problem and see what happens:

gas %>% pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value") %>% 
  pivot_wider(names_from = X1, values_from = var_value)  %>% 
  unnest()

Warning: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates

Warning: `cols` is now required when using unnest().
Please use `cols = c(Fillup, Computer, Driver)`

# A tibble: 20 × 4
   var_name Fillup Computer Driver
   <chr>     <dbl>    <dbl>  <dbl>
 1 X2            1     41.5   36.5
 2 X2            6     45     40.5
 3 X2           11     43.2   38.8
 4 X2           16     39.2   34.2
 5 X3            2     50.7   44.2
 6 X3            7     48     40  
 7 X3           12     44.6   44.5
 8 X3           17     37.3   35.2
 9 X4            3     36.6   37.2
10 X4            8     43.2   41  
11 X4           13     48.4   45.4
12 X4           18     43.5   39.8
13 X5            4     37.3   35.6
14 X5            9     47.7   42.8
15 X5           14     46.4   45.3
16 X5           19     44.3   44.9
17 X6            5     34.2   30.5
18 X6           10     42.2   39.2
19 X6           15     46.8   45.7
20 X6           20     43.3   47.5

This has worked.¹ The fillup numbers have come out in the wrong order, but that’s probably not a problem. It would also work if you had a different number of observations on each row of the original data file, as long as you had a fillup number, a computer value and a driver value for each one.

The thoughtful way to go is to organize it so that each row will have a unique combination of columns that are left. A way to do that is to note that the original data file has four “blocks” of five observations each:

gas

# A tibble: 12 × 6
   X1          X2    X3    X4    X5    X6
   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Fillup     1     2     3     4     5  
 2 Computer  41.5  50.7  36.6  37.3  34.2
 3 Driver    36.5  44.2  37.2  35.6  30.5
 4 Fillup     6     7     8     9    10  
 5 Computer  45    48    43.2  47.7  42.2
 6 Driver    40.5  40    41    42.8  39.2
 7 Fillup    11    12    13    14    15  
 8 Computer  43.2  44.6  48.4  46.4  46.8
 9 Driver    38.8  44.5  45.4  45.3  45.7
10 Fillup    16    17    18    19    20  
11 Computer  39.2  37.3  43.5  44.3  43.3
12 Driver    34.2  35.2  39.8  44.9  47.5

Each set of three rows is one block. So if we number the blocks, each observation of Fillup, Computer, and Driver will have an X-something column that it comes from and a block, and this combination will be unique.

You could create the block column by hand easily enough, or note that each block starts with a row called Fillup and use this idea:

gas %>% mutate(block = cumsum(X1=="Fillup"))

# A tibble: 12 × 7
   X1          X2    X3    X4    X5    X6 block
   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <int>
 1 Fillup     1     2     3     4     5       1
 2 Computer  41.5  50.7  36.6  37.3  34.2     1
 3 Driver    36.5  44.2  37.2  35.6  30.5     1
 4 Fillup     6     7     8     9    10       2
 5 Computer  45    48    43.2  47.7  42.2     2
 6 Driver    40.5  40    41    42.8  39.2     2
 7 Fillup    11    12    13    14    15       3
 8 Computer  43.2  44.6  48.4  46.4  46.8     3
 9 Driver    38.8  44.5  45.4  45.3  45.7     3
10 Fillup    16    17    18    19    20       4
11 Computer  39.2  37.3  43.5  44.3  43.3     4
12 Driver    34.2  35.2  39.8  44.9  47.5     4

This works because X1=="Fillup" is either true or false. cumsum takes cumulative sums; that is, the sum of all the values in the column down to and including the one you’re looking at. It requires numeric input, though, so it turns the logical values into 1 for TRUE and 0 for FALSE and adds those up. (This is the same thing that as.numeric does.) The idea is that the value of block gets bumped by one every time you hit a Fillup line.

Then pivot-longer as before:

gas %>% mutate(block = cumsum(X1=="Fillup")) %>% 
  pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value")

# A tibble: 60 × 4
   X1       block var_name var_value
   <chr>    <int> <chr>        <dbl>
 1 Fillup       1 X2             1  
 2 Fillup       1 X3             2  
 3 Fillup       1 X4             3  
 4 Fillup       1 X5             4  
 5 Fillup       1 X6             5  
 6 Computer     1 X2            41.5
 7 Computer     1 X3            50.7
 8 Computer     1 X4            36.6
 9 Computer     1 X5            37.3
10 Computer     1 X6            34.2
# … with 50 more rows

and now you can check that the var_name - block combinations are unique for each value in X1, so pivoting wider should work smoothly now:

(gas %>% mutate(block = cumsum(X1=="Fillup")) %>% 
  pivot_longer(X2:X6, names_to = "var_name", values_to = "var_value") %>% 
  pivot_wider(names_from = X1, values_from = var_value) -> gas1)

# A tibble: 20 × 5
   block var_name Fillup Computer Driver
   <int> <chr>     <dbl>    <dbl>  <dbl>
 1     1 X2            1     41.5   36.5
 2     1 X3            2     50.7   44.2
 3     1 X4            3     36.6   37.2
 4     1 X5            4     37.3   35.6
 5     1 X6            5     34.2   30.5
 6     2 X2            6     45     40.5
 7     2 X3            7     48     40  
 8     2 X4            8     43.2   41  
 9     2 X5            9     47.7   42.8
10     2 X6           10     42.2   39.2
11     3 X2           11     43.2   38.8
12     3 X3           12     44.6   44.5
13     3 X4           13     48.4   45.4
14     3 X5           14     46.4   45.3
15     3 X6           15     46.8   45.7
16     4 X2           16     39.2   34.2
17     4 X3           17     37.3   35.2
18     4 X4           18     43.5   39.8
19     4 X5           19     44.3   44.9
20     4 X6           20     43.3   47.5

and so it does.

Sometimes a pivot_longer followed by a pivot_wider can be turned into a single pivot_longer with options (see the pivoting vignette for examples), but this appears not to be one of those.

Comparing the driver and the computer

Now that we have tidy data, we can do an analysis. These are matched-pair data (one Computer and one Driver measurement), so a sensible graph would be of the differences, a histogram, say:

gas1 %>% mutate(diff = Computer - Driver) %>% 
  ggplot(aes(x=diff)) + geom_histogram(bins=6)

There is only one observation where the driver’s measurement is much bigger than the computer’s; otherwise, there is not much to choose or the computer’s measurement is bigger. Is this something that would generalize to “all measurements”, presumably all measurements at fillup by this driver and this computer? The differences are not badly non-normal, so a \(t\)-test should be fine:

with(gas1, t.test(Computer, Driver, paired = TRUE))


    Paired t-test

data:  Computer and Driver
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean of the differences 
                   2.73

It is. The computer’s mean measurement is estimated to be between about 1.4 and 4.0 miles per gallon larger than the driver’s.

References

I did get away with using unnest the old-fashioned way, though. What I should have done is given below the second warning.↩︎

Another tidying problem