Tidying data

Tidying data

  • Data rarely come to us as we want to use them.
  • Before we can do analysis, typically have organizing to do.
  • This is typical of ANOVA-type data, “wide format”:
     pig feed1 feed2 feed3 feed4
       1  60.8  68.7  92.6  87.9
       2  57.0  67.7  92.1  84.2
       3  65.0  74.0  90.2  83.1
       4  58.6  66.3  96.5  85.7
       5  61.7  69.8  99.1  90.3
  • 20 pigs randomly allocated to one of four feeds. At end of study, weight of each pig is recorded.
  • Are any differences in mean weights among the feeds?
  • Problem: want all weights in one column, with 2nd column labelling which feed. Untidy!

Tidy and untidy data (Wickham)

  • Data set easier to deal with if:
    • each observation is one row
    • each variable is one column
    • each type of observation unit is one table
  • Data arranged this way called “tidy”; otherwise called “untidy”.
  • For the pig data:
    • response variable is weight, but scattered over 4 columns, which are levels of a factor feed.
    • Want all the weights in one column, with a second column feed saying which feed that weight goes with.
    • Then we can run aov.

Packages for this section

library(tidyverse)

Reading in the pig data

my_url <- "http://ritsokiguess.site/datafiles/pigs1.txt"
pigs1 <- read_delim(my_url, " ")
pigs1

Making it longer

  • We wanted all the weights in one column, labelled by which feed they went with.
  • This is a very common reorganization, and the magic “verb” is pivot_longer:
pigs1 %>% pivot_longer(feed1:feed4, names_to="feed", 
                       values_to="weight") -> pigs2
pigs2

Alternatives

Any way of choosing the columns to pivot longer is good, eg:

pigs1 %>% pivot_longer(-pig, names_to="feed", 
                       values_to="weight") 

or

pigs1 %>% pivot_longer(starts_with("feed"), names_to="feed", 
                       values_to="weight") 

Comments

  • pigs2 now in “long” format, ready for analysis.
  • Anatomy of pivot_longer:
    • columns to combine
    • a name for column that will contain groups (“names”)
    • a name for column that will contain measurements (“values”)

Identifying the pigs

  • Values in pig identify pigs within each group: pig 1 is four different pigs!
  • Create unique pig IDs by gluing pig number onto feed:
pigs2 %>% mutate(pig_id=str_c(feed, "_", pig)) -> pigs2
pigs2

…and finally, the analysis

  • which is just what we saw before:
weight.1 <- aov(weight ~ feed, data = pigs2)
summary(weight.1)
            Df Sum Sq Mean Sq F value   Pr(>F)    
feed         3   3521  1173.5   119.1 3.72e-11 ***
Residuals   16    158     9.8                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The mean weights of pigs on the different feeds are definitely not all equal.
  • So we run Tukey to see which ones differ (over).

Tukey

TukeyHSD(weight.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ feed, data = pigs2)

$feed
             diff        lwr       upr     p adj
feed2-feed1  8.68   3.001038 14.358962 0.0024000
feed3-feed1 33.48  27.801038 39.158962 0.0000000
feed4-feed1 25.62  19.941038 31.298962 0.0000000
feed3-feed2 24.80  19.121038 30.478962 0.0000000
feed4-feed2 16.94  11.261038 22.618962 0.0000013
feed4-feed3 -7.86 -13.538962 -2.181038 0.0055599

All of the feeds differ!

Mean weights by feed

To find the best and worst, get mean weight by feed group. I borrowed an idea from earlier to put the means in descending order:

pigs2 %>%
  group_by(feed) %>%
  summarize(mean_weight = mean(weight))%>%
  arrange(desc(mean_weight))

Feed 3 is best, feed 1 worst.

Should we have any concerns about the ANOVA?

ggplot(pigs2, aes(x = feed, y = weight)) + geom_boxplot()

Comments

  • Feed 2 has an outlier
  • But there are only 5 pigs in each group
  • The conclusion is so clear that I am OK with this.

Tuberculosis

  • The World Health Organization keeps track of number of cases of various diseases, eg. tuberculosis.
  • Some data:
my_url <- "http://ritsokiguess.site/datafiles/tb.csv"
tb <- read_csv(my_url)

The data (randomly chosen rows)

tb %>% slice_sample(n = 10)

Many rows:

nrow(tb)
[1] 5769

What we have

  • Variables: country (abbreviated), year. Then number of cases for each gender and age group, eg. m1524 is males aged 15–24. Also mu and fu, where age is unknown.
  • Lots of missings. Want to get rid of.
  • Abbreviations here.
tb %>% 
  pivot_longer(m04:fu, names_to = "genage", 
               values_to = "freq", values_drop_na = TRUE) -> tb2
  • Code for pivot_longer:
    • columns to make longer
    • column to contain the names (categorical)
    • column to contain the values (quantitative)
    • drop missings in the values

Results (some)

tb2

Separating

  • 4 columns, but 5 variables, since genage contains both gender and age group. Split that up using separate.
  • separate needs 3 things:
    • what to separate (no quotes needed),
    • what to separate into (here you do need quotes),
    • how to split.
  • For “how to split”, here “after first character”:
tb2 %>% separate_wider_position(genage, 
                                widths = c("gender" = 1, "age" = 4), 
                                too_few = "align_start") -> tb3
tb3

Tidied tuberculosis data (some)

tb3

In practice…

  • instead of doing the pipe one step at a time, you debug it one step at a time, and when you have each step working, you use that step’s output as input to the next step, thus:
tb %>%
  pivot_longer(m04:fu, names_to = "genage", 
               values_to = "freq", values_drop_na = TRUE) %>% 
  separate_wider_position(genage, 
                          widths = c("gender" = 1, "age" = 4), 
                          too_few = "align_start")
  • You can split the R code over as many lines as you like, as long as each line is incomplete, so that R knows more is to come.
  • I like to put the pipe symbol on the end of the line.

Total tuberculosis cases by year (some of the years)

tb3 %>%
  filter(between(year, 1991, 1998)) %>% 
  group_by(year) %>% summarize(total=sum(freq))
  • Something very interesting happened between 1994 and 1995.

To find out what

  • try counting up total cases by country:
tb3 %>% group_by(iso2) %>% 
  summarize(total=sum(freq)) %>% 
  arrange(desc(total))

What years do I have for China?

China started recording in 1995, which is at least part of the problem:

tb3 %>% filter(iso2=="CN") %>% 
  group_by(year) %>% 
  summarize(total=sum(freq))

First year of recording by country?

  • A lot of countries started recording in about 1995, in fact:
tb3 %>% group_by(iso2) %>% 
  summarize(first_year=min(year)) %>% 
  count(first_year)
  • So the reason for the big jump in cases is that so many countries started recording then, not that there really were more cases.

Some Toronto weather data

my_url <- 
  "http://ritsokiguess.site/STAC32/toronto_weather.csv"
weather <- read_csv(my_url)
weather

The columns

  • Daily weather records for “Toronto City” weather station in 2018:

    • station: identifier for this weather station (always same here)
    • Year, Month
    • element: whether temperature given was daily max or daily min
    • d01, d02,… d31: day of the month from 1st to 31st.

Off we go

Numbers in data frame all temperatures (for different days of the month), so first step is

weather %>% 
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE)

Element

  • Column element contains names of two different variables, that should each be in separate column.
  • Distinct from eg. m1524 in tuberculosis data, that contained levels of two different factors, handled by separate.
  • Untangling names of variables handled by pivot_wider.

Handling element

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from=element, 
                values_from=temperature) 

Further improvements 1/2

  • We have tidy data now, but can improve things further.
  • mutate creates new columns from old (or assign back to change a variable).
  • Would like numerical dates. separate works, or pull out number as below.
  • select keeps columns (or drops, with minus). Station name has no value to us.

Further improvements 2/2

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", values_drop_na = TRUE) %>% 
  pivot_wider(names_from=element, values_from=temperature) %>% 
  mutate(Day = parse_number(day)) %>%
  select(-station)

Final step(s)

  • Make year-month-day into proper date.
  • Keep only date, tmax, tmin:
weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", values_drop_na = T) %>% 
  pivot_wider(names_from=element, values_from=temperature) %>% 
  mutate(Day = parse_number(day)) %>%
  select(-station) %>% 
  unite(datestr, c(Year, Month, Day), sep = "-") %>%
  mutate(date = as.Date(datestr)) %>%
  select(c(date, tmax, tmin)) -> weather_tidy

Our tidy data frame

weather_tidy

Plotting the temperatures

  • Plot temperature against date joined by lines, but with separate lines for max and min. ggplot requires something like
ggplot(..., aes(x = date, y = temperature)) + geom_point() + 
  geom_line()

only we have two temperatures, one a max and one a min, that we want to keep separate.

  • The trick: combine tmax and tmin together into one column, keeping track of what kind of temp they are. (This actually same format as untidy weather.) Are making weather_tidy untidy for purposes of drawing graph only.
  • Then can do something like
ggplot(d, aes(x = date, y = temperature, colour = maxmin)) 
  + geom_point() + geom_line()

to distinguish max and min on graph.

Setting up plot

  • Since we only need data frame for plot, we can do the column-creation and plot in a pipeline.
  • For a ggplot in a pipeline, the initial data frame is omitted, because it is whatever came out of the previous step.
  • To make those “one column”s: pivot_longer. I save the graph to show overleaf:
weather_tidy %>%
  pivot_longer(tmax:tmin, names_to="maxmin", 
               values_to="temperature") %>%
  ggplot(aes(x = date, y = temperature, colour = maxmin)) + geom_point() +
      geom_line() -> g

The plot

g

Summary of tidying “verbs”

Verb Purpose
pivot_longer Combine columns that measure same thing into one
pivot_wider Take column that measures one thing under different conditions and put into multiple columns
separate Turn a column that encodes several variables into several columns
unite Combine several (related) variables into one “combination” variable

pivot_longer and pivot_wider are opposites; separate and unite are opposites.