Tidying data

Data rarely come to us as we want to use them.
Before we can do analysis, typically have organizing to do.
This is typical of ANOVA-type data, “wide format”:

     pig feed1 feed2 feed3 feed4
       1  60.8  68.7  92.6  87.9
       2  57.0  67.7  92.1  84.2
       3  65.0  74.0  90.2  83.1
       4  58.6  66.3  96.5  85.7
       5  61.7  69.8  99.1  90.3

20 pigs randomly allocated to one of four feeds. At end of study, weight of each pig is recorded.
Are any differences in mean weights among the feeds?
Problem: want all weights in one column, with 2nd column labelling which feed. Untidy!

Tidy and untidy data (Wickham)

Data set easier to deal with if:
- each observation is one row
- each variable is one column
- each type of observation unit is one table
Data arranged this way called “tidy”; otherwise called “untidy”.
For the pig data:
- response variable is weight, but scattered over 4 columns, which are levels of a factor feed.
- Want all the weights in one column, with a second column feed saying which feed that weight goes with.
- Then we can run aov.

Packages for this section

library(tidyverse)

Reading in the pig data

my_url <- "http://ritsokiguess.site/datafiles/pigs1.txt"
pigs1 <- read_delim(my_url, " ")
pigs1

Making it longer

We wanted all the weights in one column, labelled by which feed they went with.
This is a very common reorganization, and the magic “verb” is pivot_longer:

pigs1 %>% pivot_longer(feed1:feed4, names_to="feed", 
                       values_to="weight") -> pigs2

The long dataframe `pigs2`

Alternatives

Any way of choosing the columns to pivot longer is good, eg:

pigs1 %>% pivot_longer(-pig, names_to="feed", 
                       values_to="weight")

pigs1 %>% pivot_longer(starts_with("feed"), names_to="feed", 
                       values_to="weight")

Comments

pigs2 now in “long” format, ready for analysis.
Anatomy of pivot_longer:
- columns to combine
- a name for column that will contain groups (“names”)
- a name for column that will contain measurements (“values”)

Identifying the pigs

Values in pig identify pigs within each group: pig 1 is four different pigs!
Create unique pig IDs by gluing pig number onto feed:

pigs2 %>% mutate(pig_id=str_c(feed, "_", pig)) -> pigs2

The new `pigs2`

…and finally, the analysis

which is just what we saw before:

weight.1 <- aov(weight ~ feed, data = pigs2)
summary(weight.1)

            Df Sum Sq Mean Sq F value   Pr(>F)    
feed         3   3521  1173.5   119.1 3.72e-11 ***
Residuals   16    158     9.8                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mean weights of pigs on the different feeds are definitely not all equal.
So we run Tukey to see which ones differ (over).

Tukey

TukeyHSD(weight.1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ feed, data = pigs2)

$feed
             diff        lwr       upr     p adj
feed2-feed1  8.68   3.001038 14.358962 0.0024000
feed3-feed1 33.48  27.801038 39.158962 0.0000000
feed4-feed1 25.62  19.941038 31.298962 0.0000000
feed3-feed2 24.80  19.121038 30.478962 0.0000000
feed4-feed2 16.94  11.261038 22.618962 0.0000013
feed4-feed3 -7.86 -13.538962 -2.181038 0.0055599

All of the feeds differ!

Mean weights by feed

To find the best and worst, get mean weight by feed group. I borrowed an idea from earlier to put the means in descending order:

pigs2 %>%
  group_by(feed) %>%
  summarize(mean_weight = mean(weight))%>%
  arrange(desc(mean_weight))

Feed 3 is best, feed 1 worst.

Should we have any concerns about the ANOVA?

ggplot(pigs2, aes(x = feed, y = weight)) + geom_boxplot()

Comments

Feed 2 has an outlier
But there are only 5 pigs in each group
The conclusion is so clear that I am OK with this.

Tuberculosis

The World Health Organization keeps track of number of cases of various diseases, eg. tuberculosis.
Some data:

my_url <- "http://ritsokiguess.site/datafiles/tb.csv"
tb <- read_csv(my_url)

The data (10 randomly chosen rows)

tb %>% slice_sample(n = 10)

Many rows and columns

nrow(tb)

[1] 5769

ncol(tb)

[1] 22

What we have

Variables: country (abbreviated), year. Then number of cases for each gender and age group, eg. m1524 is males aged 15–24. Also mu and fu, where age is unknown.
Lots of missings. Want to get rid of.
Abbreviations here.

tb %>% 
  pivot_longer(m04:fu, names_to = "genage", 
               values_to = "freq", values_drop_na = TRUE) -> tb2

Code for pivot_longer:
- columns to make longer
- column to contain the names (categorical)
- column to contain the values (quantitative)
- drop missings in the values

Results (some)

tb2

Separating

4 columns, but 5 variables, since genage contains both gender and age group. Split that up using separate.
separate needs to know:
- what to separate (no quotes needed),
- how to split, and what to separate into (here you do need quotes):

tb2 %>% 
  separate_wider_position(genage, 
                          widths = c("gender" = 1, "age" = 4), 
                          too_few = "align_start") -> tb3

Tidied tuberculosis data (some)

tb3

In practice…

instead of doing the pipe one step at a time, you debug it one step at a time, and when you have each step working, you use that step’s output as input to the next step, thus:

tb %>%
  pivot_longer(m04:fu, names_to = "genage", 
               values_to = "freq", values_drop_na = TRUE) %>% 
  separate_wider_position(genage, 
                          widths = c("gender" = 1, "age" = 4), 
                          too_few = "align_start") -> tb_tidy

When you have it working, save the final result (for further work).

Comments

You can split the R code over as many lines as you like, as long as each line is incomplete, so that R knows more is to come.
I like to put the pipe symbol on the end of the line.
Sometimes one function call gets very long, in which case you can separate at commas.

Total tuberculosis cases by year (some of the years)

tb_tidy %>%
  filter(between(year, 1991, 1998)) %>% 
  group_by(year) %>% summarize(total=sum(freq))

Something very interesting happened between 1994 and 1995.

To find out what

try counting up total cases by country:

tb_tidy %>% group_by(iso2) %>% 
  summarize(total=sum(freq)) %>% 
  arrange(desc(total))

What years do I have for China?

China started recording in 1995, which is at least part of the problem:

tb_tidy %>% filter(iso2 == "CN") %>% 
  group_by(year) %>% 
  summarize(total = sum(freq))

First year of recording by country?

A lot of countries started recording in about 1995, in fact:

tb_tidy %>% group_by(iso2) %>% 
  summarize(first_year=min(year)) %>% 
  count(first_year)

Comment

So the reason for the big jump in cases is that so many countries started recording then, not that there really were more cases.

Some Toronto weather data

my_url <- "http://ritsokiguess.site/STAC32/toronto_weather.csv"
weather <- read_csv(my_url)
weather

The columns

Daily weather records for “Toronto City” weather station in 2018:
- station: identifier for this weather station (always same here)
- Year, Month
- element: whether temperature given was daily max or daily min
- d01, d02,… d31: day of the month from 1st to 31st.

Off we go

Numbers in data frame all temperatures (for different days of the month), so first step is

weather %>% 
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE)

`Element`

Column element contains names of two different variables, that should each be in separate column.
Distinct from eg. m1524 in tuberculosis data, that contained levels of two different factors, handled by separate.
Untangling names of variables handled by pivot_wider.

Handling `element`

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", 
               values_drop_na = TRUE) %>% 
  pivot_wider(names_from=element, 
                values_from=temperature)

Further improvements 1/2

We have tidy data now, but can improve things further.
mutate creates new columns from old (or assign back to change a variable).
Would like numerical dates. separate works, or pull out number as below.
select keeps columns (or drops, with minus). Station name has no value to us.

Further improvements 2/2

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", values_drop_na = TRUE) %>% 
  pivot_wider(names_from=element, values_from=temperature) %>% 
  mutate(Day = parse_number(day)) %>%
  select(-station)

Final step(s)

Make year-month-day into proper date.
Keep only date, tmax, tmin:

weather %>%
  pivot_longer(d01:d31, names_to="day", 
               values_to="temperature", values_drop_na = T) %>% 
  pivot_wider(names_from=element, values_from=temperature) %>% 
  mutate(Day = parse_number(day)) %>%
  select(-station) %>% 
  unite(datestr, c(Year, Month, Day), sep = "-") %>%
  mutate(date = as.Date(datestr)) %>%
  select(c(date, tmax, tmin)) -> weather_tidy

Our tidy data frame

weather_tidy

Plotting the temperatures

Plot temperature against date joined by lines, but with separate lines for max and min. ggplot requires something like

ggplot(..., aes(x = date, y = temperature)) + geom_point() + 
  geom_line()

only we have two temperatures, one a max and one a min, that we want to keep separate.

The trick: combine tmax and tmin together into one column, keeping track of what kind of temp they are. (This actually same format as untidy weather.) Are making weather_tidy untidy for purposes of drawing graph only.
Then can do something like

ggplot(d, aes(x = date, y = temperature, colour = maxmin)) 
  + geom_point() + geom_line()

to distinguish max and min on graph.

Setting up plot

Since we only need data frame for plot, we can do the column-creation and plot in a pipeline.
For a ggplot in a pipeline, the initial data frame is omitted, because it is whatever came out of the previous step.
To make those “one column”s: pivot_longer. I save the graph to show overleaf:

weather_tidy %>%
  pivot_longer(tmax:tmin, names_to="maxmin", 
               values_to="temperature") %>%
  ggplot(aes(x = date, y = temperature, colour = maxmin)) + 
      geom_point() + geom_line() -> g

The plot

Summary of tidying “verbs”

Verb	Purpose
`pivot_longer`	Combine columns that measure same thing into one
`pivot_wider`	Take column that measures one thing under different conditions and put into multiple columns
`separate`	Turn a column that encodes several variables into several columns
`unite`	Combine several (related) variables into one “combination” variable

pivot_longer and pivot_wider are opposites; separate and unite are opposites.

Tidying data

Tidying data

Tidy and untidy data (Wickham)

Packages for this section

Reading in the pig data

Making it longer

The long dataframe pigs2

Alternatives

Comments

Identifying the pigs

The new pigs2

…and finally, the analysis

Tukey

Mean weights by feed

Should we have any concerns about the ANOVA?

Comments

Tuberculosis

The data (10 randomly chosen rows)

Many rows and columns

What we have

Results (some)

Separating

Tidied tuberculosis data (some)

In practice…

Comments

Total tuberculosis cases by year (some of the years)

To find out what

What years do I have for China?

First year of recording by country?

Comment

Some Toronto weather data

The columns

Off we go

Element

Handling element

Further improvements 1/2

Further improvements 2/2

Final step(s)

Our tidy data frame

Plotting the temperatures

Setting up plot

The plot

Summary of tidying “verbs”

The long dataframe `pigs2`

The new `pigs2`

`Element`

Handling `element`