Packages for this section
Reading in the pig data
my_url <- "http://ritsokiguess.site/datafiles/pigs1.txt"
pigs1 <- read_delim (my_url, " " )
pigs1
Making it longer
We wanted all the weights in one column, labelled by which feed they went with.
This is a very common reorganization, and the magic “verb” is pivot_longer
:
pigs1 %>% pivot_longer (feed1: feed4, names_to= "feed" ,
values_to= "weight" ) -> pigs2
pigs2
Alternatives
Any way of choosing the columns to pivot longer is good, eg:
pigs1 %>% pivot_longer (- pig, names_to= "feed" ,
values_to= "weight" )
or
pigs1 %>% pivot_longer (starts_with ("feed" ), names_to= "feed" ,
values_to= "weight" )
Identifying the pigs
Values in pig
identify pigs within each group : pig 1 is four different pigs!
Create unique pig IDs by gluing pig number onto feed:
pigs2 %>% mutate (pig_id= str_c (feed, "_" , pig)) -> pigs2
pigs2
…and finally, the analysis
which is just what we saw before:
weight.1 <- aov (weight ~ feed, data = pigs2)
summary (weight.1 )
Df Sum Sq Mean Sq F value Pr(>F)
feed 3 3521 1173.5 119.1 3.72e-11 ***
Residuals 16 158 9.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The mean weights of pigs on the different feeds are definitely not all equal.
So we run Tukey to see which ones differ (over).
Tukey
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ feed, data = pigs2)
$feed
diff lwr upr p adj
feed2-feed1 8.68 3.001038 14.358962 0.0024000
feed3-feed1 33.48 27.801038 39.158962 0.0000000
feed4-feed1 25.62 19.941038 31.298962 0.0000000
feed3-feed2 24.80 19.121038 30.478962 0.0000000
feed4-feed2 16.94 11.261038 22.618962 0.0000013
feed4-feed3 -7.86 -13.538962 -2.181038 0.0055599
All of the feeds differ!
Mean weights by feed
To find the best and worst, get mean weight by feed group. I borrowed an idea from earlier to put the means in descending order:
pigs2 %>%
group_by (feed) %>%
summarize (mean_weight = mean (weight))%>%
arrange (desc (mean_weight))
Feed 3 is best, feed 1 worst.
Should we have any concerns about the ANOVA?
ggplot (pigs2, aes (x = feed, y = weight)) + geom_boxplot ()
Tuberculosis
The World Health Organization keeps track of number of cases of various diseases, eg. tuberculosis.
Some data:
my_url <- "http://ritsokiguess.site/datafiles/tb.csv"
tb <- read_csv (my_url)
The data (randomly chosen rows)
tb %>% slice_sample (n = 10 )
Many rows:
What we have
Variables: country (abbreviated), year. Then number of cases for each gender and age group, eg. m1524
is males aged 15–24. Also mu
and fu
, where age is unknown.
Lots of missings. Want to get rid of.
Abbreviations here .
tb %>%
pivot_longer (m04: fu, names_to = "genage" ,
values_to = "freq" , values_drop_na = TRUE ) -> tb2
Code for pivot_longer
:
columns to make longer
column to contain the names (categorical)
column to contain the values (quantitative)
drop missings in the values
Separating
4 columns, but 5 variables, since genage
contains both gender and age group. Split that up using separate.
separate
needs 3 things:
what to separate (no quotes needed),
what to separate into (here you do need quotes),
how to split.
For “how to split”, here “after first character”:
tb2 %>% separate_wider_position (genage,
widths = c ("gender" = 1 , "age" = 4 ),
too_few = "align_start" ) -> tb3
tb3
Tidied tuberculosis data (some)
In practice…
instead of doing the pipe one step at a time, you debug it one step at a time, and when you have each step working, you use that step’s output as input to the next step, thus:
tb %>%
pivot_longer (m04: fu, names_to = "genage" ,
values_to = "freq" , values_drop_na = TRUE ) %>%
separate_wider_position (genage,
widths = c ("gender" = 1 , "age" = 4 ),
too_few = "align_start" )
You can split the R code over as many lines as you like, as long as each line is incomplete, so that R knows more is to come.
I like to put the pipe symbol on the end of the line.
Total tuberculosis cases by year (some of the years)
tb3 %>%
filter (between (year, 1991 , 1998 )) %>%
group_by (year) %>% summarize (total= sum (freq))
Something very interesting happened between 1994 and 1995.
To find out what
try counting up total cases by country:
tb3 %>% group_by (iso2) %>%
summarize (total= sum (freq)) %>%
arrange (desc (total))
What years do I have for China?
China started recording in 1995, which is at least part of the problem:
tb3 %>% filter (iso2== "CN" ) %>%
group_by (year) %>%
summarize (total= sum (freq))
First year of recording by country?
A lot of countries started recording in about 1995, in fact:
tb3 %>% group_by (iso2) %>%
summarize (first_year= min (year)) %>%
count (first_year)
So the reason for the big jump in cases is that so many countries started recording then, not that there really were more cases.
Some Toronto weather data
my_url <-
"http://ritsokiguess.site/STAC32/toronto_weather.csv"
weather <- read_csv (my_url)
weather
Off we go
Numbers in data frame all temperatures (for different days of the month), so first step is
weather %>%
pivot_longer (d01: d31, names_to= "day" ,
values_to= "temperature" ,
values_drop_na = TRUE )
Element
Column element
contains names of two different variables, that should each be in separate column.
Distinct from eg. m1524
in tuberculosis data, that contained levels of two different factors, handled by separate.
Untangling names of variables handled by pivot_wider
.
Handling element
weather %>%
pivot_longer (d01: d31, names_to= "day" ,
values_to= "temperature" ,
values_drop_na = TRUE ) %>%
pivot_wider (names_from= element,
values_from= temperature)
Further improvements 1/2
We have tidy data now, but can improve things further.
mutate
creates new columns from old (or assign back to change a variable).
Would like numerical dates. separate
works, or pull out number as below.
select
keeps columns (or drops, with minus). Station name has no value to us.
Further improvements 2/2
weather %>%
pivot_longer (d01: d31, names_to= "day" ,
values_to= "temperature" , values_drop_na = TRUE ) %>%
pivot_wider (names_from= element, values_from= temperature) %>%
mutate (Day = parse_number (day)) %>%
select (- station)
Final step(s)
Make year-month-day into proper date.
Keep only date, tmax, tmin:
weather %>%
pivot_longer (d01: d31, names_to= "day" ,
values_to= "temperature" , values_drop_na = T) %>%
pivot_wider (names_from= element, values_from= temperature) %>%
mutate (Day = parse_number (day)) %>%
select (- station) %>%
unite (datestr, c (Year, Month, Day), sep = "-" ) %>%
mutate (date = as.Date (datestr)) %>%
select (c (date, tmax, tmin)) -> weather_tidy
Plotting the temperatures
Plot temperature against date joined by lines, but with separate lines for max and min. ggplot
requires something like
ggplot (..., aes (x = date, y = temperature)) + geom_point () +
geom_line ()
only we have two temperatures, one a max and one a min, that we want to keep separate.
The trick: combine tmax
and tmin
together into one column, keeping track of what kind of temp they are. (This actually same format as untidy weather
.) Are making weather_tidy
untidy for purposes of drawing graph only.
Then can do something like
ggplot (d, aes (x = date, y = temperature, colour = maxmin))
+ geom_point () + geom_line ()
to distinguish max and min on graph.
Setting up plot
Since we only need data frame for plot, we can do the column-creation and plot in a pipeline.
For a ggplot
in a pipeline, the initial data frame is omitted, because it is whatever came out of the previous step.
To make those “one column”s: pivot_longer
. I save the graph to show overleaf:
weather_tidy %>%
pivot_longer (tmax: tmin, names_to= "maxmin" ,
values_to= "temperature" ) %>%
ggplot (aes (x = date, y = temperature, colour = maxmin)) + geom_point () +
geom_line () -> g
The plot
Summary of tidying “verbs”
pivot_longer
Combine columns that measure same thing into one
pivot_wider
Take column that measures one thing under different conditions and put into multiple columns
separate
Turn a column that encodes several variables into several columns
unite
Combine several (related) variables into one “combination” variable
pivot_longer
and pivot_wider
are opposites; separate
and unite
are opposites.
Comments
pigs2
now in “long” format, ready for analysis.pivot_longer
: