library(tidyverse)STAC32 Assignment 7
Packages
Canadian cities
For the 25 biggest cities in Canada, the population was recorded in each of 1992 through 1996 (in thousands). These data can be found in http://ritsokiguess.site/datafiles/canada_cities.csv. In addition, each of those cities was classified as belonging to a region of the country: Ontario or Quebec, or West (BC or Alberta), Prairies (Saskatchewan or Manitoba), or Atlantic (Nova Scotia, New Brunswick, PEI, or Newfoundland). These data are in http://ritsokiguess.site/datafiles/canada_regions.csv.
- (2 points) Read in and display (some of) both the cities data and the regions data.
Twice as much as the usual (easy) work:
my_url <- "http://ritsokiguess.site/datafiles/canada_cities.csv"
cities <- read_csv(my_url)Rows: 125 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): city
dbl (2): year, population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
my_url <- "http://ritsokiguess.site/datafiles/canada_regions.csv"
regions <- read_csv(my_url)Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): city, region
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
citiesregionsExtra 1: this is the cities data from the DAAG package. These data originally came wide rather than long, but I wanted a problem where you had to use pivot_wider, so I made them long so that you would have to convert them back (later):
city0 <- DAAG::cities
# library(DAAG)
# data(cities)
# citiesThis is how I made the datasets I saved for you. See below for notes.
city0 %>% janitor::clean_names() -> city0
city0city0 %>% select(city, region) -> regions
regionsand then
city0 %>% select(-region) %>%
pivot_longer(-city, names_to = "year",
values_to = "population") %>%
mutate(year = parse_number(year)) -> cities_long
cities_longNotes:
- the original data set had the column names in uppercase.
clean_namesfrom thejanitorpackage will convert them all to lowercase, along with doing other tidying such as replacing spaces in column names by underscores, and putting anXon the front of column names that start with a number. The guarantee is that all the columns will have R-legal column names after you have runclean_names. - I decided to get rid of the regions and put them in a separate table, for reasons I explain below.
- The years seemed to make more sense as numbers rather than a name like
pop1996, so I pulled out just the year as a number (parse_numbertakes a piece of text and pulls out the first number it finds in it).
Extra 2:
If you make the data long without removing the regions, this is what happens:
city0 %>%
pivot_longer(starts_with("pop"), names_to = "year", values_to = "population") %>%
mutate(year = parse_number(year))The regions get repeated, because the cities got repeated. There is redundancy here; if Toronto is in Ontario once, it will be every time. So you really only need each city once along with its region. This is why I made a second table of regions. The idea is that with the dataframe above, if you decide to reorganize the regions, you have to remember to change the regions everywhere they appear (which will be more than once), and you increase your chances of making a mistake. With a separate table of regions, you only have to change a region once for each city.
This is the extra part of “tidy”: each type of observational unit is in its own table. Database people call this “normal form”, and when you have a large number of related tables (dataframes) in a database, you need to keep this straight, or else things can get out of hand very quickly.
- (4 points) Create a table containing the cities and their regions, with a separate column for each year.
The “separate column for each year” suggests pivoting wider:
cities %>% pivot_wider(names_from = year, values_from = population)but this doesn’t have the region yet, which is in a separate dataframe, so we need to left-join it onto this:
cities %>% pivot_wider(names_from = year, values_from = population) %>%
left_join(regions, join_by(city)) -> cities_wide
cities_wideand now we have the region for each city appended onto the dataframe.1 The join_by is optional, but including it makes it clear that you are matching up the city column that both dataframes have.
There’s one more question here, which suggests that it would be a good idea to save this result.
- (3 points) Find the fourth largest city in the west region, by 1996 population (without displaying the first through third largest city in the west). Describe your strategy briefly. Hint: to work with a column whose name starts with a number, surround the column name with backticks (on the key to the left of 1 on your keyboard).
Two ways that I can immediately think of. The first, and probably more obvious, is this:
- grab only the cities in the west region,
- sort them into descending order by 1996 population (the column is called
1996), - then show the 4th one.
cities_wide %>%
filter(region == "WEST") %>%
arrange(desc(`1996`)) %>%
slice(4)giving an answer of Victoria (the one in BC).
The second, less obvious, way takes advantage of the fact that sorting works within groups, so you can group by region first. This strategy, then, is
- group by region
- sort in descending order by 1996 population (within region)
- display the 4th largest city (in all the regions that have one):
cities_wide %>%
group_by(region) %>%
arrange(desc(`1996`)) %>%
slice(4)which displays the fourth largest city in each region, and hence the answer you want is once again Victoria. You can check that, apart from Ontario, Quebec, and the West, the other regions don’t have four cities (in this dataframe, at least).
Note that the question did not forbid you from displaying the fourth largest city in other regions as well, so this is a full-credit answer. (If you thought of this way, I didn’t want to stop you from using it: group-by is surprisingly helpful.)
You can talk about your strategy first, and then show the code that carries out your strategy, or you can show the code first and then talk about what it is doing. But you need to have code plus some description of what you are doing, in some order.
Find a way of doing it, possibly different from these, that works and that you can explain why it works.
Extra: you might think of using slice_max in a situation like this, to save doing the sorting, but:
cities_wide %>%
filter(region == "WEST") %>%
slice_max(`1996`, n = 4)this displays all of the first through fourth largest cities in the west, which the question asked you to not do.
Miscellaneous tidying
The questions below require you to read a datafile and to produce a new version of it that is in the format specified.
- (2 points) The data in http://ritsokiguess.site/datafiles/eagle_pairs.csv show the number of nesting pairs of eagles observed in each (US) state for a number of years. Your task is to produce a dataframe with each row containing the number of nesting pairs in one state in one year. You will probably see some missing values in your answer; for this question, you may leave them. Hints (you might find either or both of them helpful): (i) when you use a select-helper, you can put a ! before it to select all the columns except those that match the select-helper; (ii) if you want to refer to a column whose name is (or starts with) a number, you have to surround the column name with backticks, for example
`22x`.
Read the data in first, and see what you have:
my_url <- "http://ritsokiguess.site/datafiles/eagle_pairs.csv"
eagle_pairs <- read_csv(my_url)Rows: 48 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, state_abbr
dbl (10): 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eagle_pairsYou want to pivot all the columns from 1997 onwards longer, to get one observation per row. (The years go up to 2006.)
The obvious way is to refer to all the years you want to pivot longer. To do this, use the second hint, but it is otherwise an ordinary pivot-longer:
eagle_pairs %>% pivot_longer(`1997`:`2006`, names_to = "year", values_to = "pairs")The first hint is how you take advantage of the columns you don’t want to pivot-longer being the ones starting with state:
eagle_pairs %>% pivot_longer(!starts_with("state"), names_to = "year", values_to = "pairs")Either way is good. In pivot_longer, the columns in names_to and values_to must be in quotes, because you are creating them (they don’t exist yet). Any reasonable names for these columns are fine.
Extra: you might have noticed some redundancy in that the state abbreviations have been repeated as well, and (for example) AL is always going to go with Alabama. Probably the right way to deal with this is to have a separate dataframe of state names and state abbreviations, and just have one of those columns here. This gets into issues of “normal form”, which I discuss elsewhere (in the question about Canadian cities, where I also have the region of Canada of each city).
Extra extra: these are the example_eagle_pairs data from the dcldata package.
- (3 points) In the Olympic Games, the sport of gymnastics has team events, where five gymnasts from each country compete in the vault and the floor event, and their scores out of 10 are added together to produce a team score. There are four team scores at each Olympic Games: vault and floor, for each of male and female gymnasts. The data in http://ritsokiguess.site/datafiles/gymnastics3.csv contain scores for three different teams in each of the two events at two different Olympic Games, for each of the men’s and women’s teams. Your task is to produce a dataframe with one column of scores, with additional columns labelling the event, Olympic Games and gender that the score belongs to.
First, read in and take a look:
my_url <- "http://ritsokiguess.site/datafiles/gymnastics3.csv"
gymnastics <- read_csv(my_url)Rows: 3 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (8): vault_2012_f, vault_2012_m, vault_2016_f, vault_2016_m, floor_2012_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gymnasticsThis is very wide: the scores are scattered over 8 columns. So we’ll need some kind of pivot-longer. The extra thing this time is that the column names we have here actually encode three things: an event, a year, and a gender. The best way to handle this is to do two things: (i) to use more than one column in names_to, and (ii) to use names_sep to say what those columns are separated by in the dataframe we have (an underscore):
gymnastics %>%
pivot_longer(-country, names_to = c("event", "year", "gender"),
names_sep = "_", values_to = "score")That’s a three-point answer.
If you don’t see that, do a regular pivot-longer and then figure out what to do next, something like this:
gymnastics %>%
pivot_longer(-country, names_to = "name", values_to = "score")Now you see that the values in name actually encode three things: an event, a year, and a gender, so you can use separate_wider to split them up:
gymnastics %>%
pivot_longer(-country, names_to = "name", values_to = "score") %>%
separate_wider_delim(name, delim = "_", names = c("event", "year", "gender"))This is only two points, because you could have done it a more direct way (the first way).
Extra: these are the example_gymnastics_3 data from the dcldata package, used as is. If you want to investigate, the installation of dcldata is like the installation of smmr:
# install.packages("devtools")
devtools::install_github("dcl-docs/dcldata")- (4 points) The data in http://ritsokiguess.site/datafiles/children.csv are the names of up to two children in each of six families, along with their dates of birth. The birthdate numbered 1 goes with the child numbered 1 in that family. One of the families has only one child. Your task is to produce a dataframe with only one child in each row: their name, along with their date of birth and the number of the family they belong to.
Read in the data first:
my_url <- "http://ritsokiguess.site/datafiles/children.csv"
child_names <- read_csv(my_url)Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): name_1, name_2
dbl (1): family
date (2): dob_1, dob_2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
child_namesThere are six families. One of the families (number 3) has only one child, as evidenced by the missing value for the name and date of birth of the second child.
Most of the rows have two children on them, so we need to pivot-longer. The column names that we need to pivot-longer have two pieces of information in them: what they are, and what number child they apply to. So we will need the variant of pivot_longer with two things in the names_to.
The second thing to observe is that dob and name need to be names of new columns, so we have to use the .value thing as the first of the names_to:
child_names %>%
pivot_longer(-family, names_to = c(".value", "child_number"), names_sep = "_")Once you see how this goes, it will probably seem almost magic.
This has created columns with the right names, but there is one row of missing values (corresponding to the lack of a second child in family 3), so we finish by getting rid of this row:
child_names %>%
pivot_longer(-family, names_to = c(".value", "number"), names_sep = "_") %>%
drop_na()A second option is to drop the NAs directly in the pivot_longer, as we did with the tuberculosis data in lecture:
child_names %>%
pivot_longer(-family, names_to = c(".value", "child_number"),
names_sep = "_", values_drop_na = TRUE) You should finish with a dataframe with 11 rows and no missing values. Most of the families now occupy two rows.
Extra: I got the names from students enrolled in this class,2 and made up some birth dates (so these are made-up data, but typical of the kind of thing you would run into). You can use what is essentially a CSV file, but typed as a piece of text into a variable:3
names_txt <- "
family,dob_1,dob_2,name_1,name_2
1,2003-11-13,2006-09-03,Maheen,Muhammad
2,2004-03-01,2005-11-02,Zhao,Chen
3,2006-08-12,,Alex,
4,2002-03-11,2007-01-04,Andrew,Kayleigh
5,2005-12-07,2008-03-22,Jiahui,Zheng
6,2004-07-22,2007-09-12,Mehar,Ayesha
"and then I read this in as if it were a CSV file:
names <- read_csv(names_txt)Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): name_1, name_2
dbl (1): family
date (2): dob_1, dob_2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
namesand this is what I saved for you. I hope I have managed to get the families with two children to have their names be of the same ethnicity.
An additional note: these birth dates were recorded as year-month-day, so read_csv will read them in as actual dates (rather than as pieces of text that look like dates). When we get to D29, we’ll see how to deal with dates written some other way (Americans like to start with the month), and how to do calculations with dates (like, seeing how old the first child was when the second one was born4).
Gases
- (3 points) The data in http://ritsokiguess.site/datafiles/gases1.csv are measurements of volume of a gas (in column
value) at different temperatures and pressures, measured in suitable units. Your task is to produce a dataframe with four rows, with one column for the low temperatures and one for the high ones.
First, of course, read the data and see what you have:
my_url <- "http://ritsokiguess.site/datafiles/gases1.csv"
gases <- read_csv(my_url)Rows: 8 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): temperature, pressure
dbl (2): id, value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gasesThere are eight rows right now, and we want to reduce it to four, which means making things wider.
This is a standard pivot-wider. We want each temperature in a separate column, so temperature goes in the names_from, and the volumes in value go in the values_from:
gases %>% pivot_wider(names_from = temperature, values_from = value)With this arrangement of the data, the low and hi in the column names refer to temperature; whether the pressure is low or high is still in the column pressure.
Extra 1: these are made-up data (sorry), but typical of what you might meet. I started from here:
data_txt <- "
id,temperature,pressure,value
100,low,low,67
100,low,hi,71
100,hi,low,66
100,hi,hi,70
101,low,low,77
101,low,hi,81
101,hi,low,76
101,hi,hi,80
"and read this into a dataframe:
gases0 <- read_csv(data_txt)Rows: 8 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): temperature, pressure
dbl (2): id, value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gases0which I then saved for you. The dataframe used in the next question was obtained by adding a column to this one.
Extra 2: if you throw two columns into names_from, this happens:
gases %>% pivot_wider(names_from = c(temperature, pressure), values_from = value)The column names are now two lows and highs, the first one being temperature and the second one being pressure. The only thing remaining in rows is the two different id values (that presumably denote different experimental runs or something like that).
Extra 3: If you have studied chemistry, you might recall the Ideal Gas Law, that relates pressure, temperature, and volume for a particular gas. The Law says that \(PV = nRT\), where \(P\) is pressure, \(V\) is volume, \(n\) is the number of moles of the gas, \(R\) is a constant, and \(T\) is temperature. I have no idea whether my data gets anywhere close to obeying the Ideal Gas Law. I came up with the idea first of having low and high values of temperature and pressure, and the Ideal Gas Law says that volume is something that should depend on these, so that was the variable I claimed to have measured. As I say, the sort of thing you might have measured, but not actually real data.
- (3 points) The scenario for this question is the same as for Question 7, except that the data, in http://ritsokiguess.site/datafiles/gases2.csv, now have an additional column called
value_2. In this case, we are interested in the variablevalue_2rather thanvalue. By using analogous pivoting code to the previous question, but without otherwise changing the dataframe in any way, explain what has happened and why it happened.
Several things to do here. Start by reading in this dataframe:
my_url <- "http://ritsokiguess.site/datafiles/gases2.csv"
gases2 <- read_csv(my_url)Rows: 8 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): temperature, pressure
dbl (3): id, value, value_2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gases2This has an extra column value_2 as promised, and is otherwise identical to the dataframe of the previous question.
Second: we are now interested in value_2 rather than value, so copy your pivot-wider code from the previous question and make two changes:
- change the name of the dataframe to
gases2(or whatever you called your second dataframe) - change the input to
values_fromtovalue_2:
gases2 %>% pivot_wider(names_from = temperature, values_from = value_2)We have gained some missing values, and we have the same eight rows that the original dataframe did.
To explain what happened: what determines which row an observation goes into, we look at all the columns not involved in the pivot-wider: id, pressure and value. Each distinct combination of these three variables has its own row, but the values in value are all different within each combination of id and pressure, so there are eight combinations and thus still eight rows, with missings going where a low or hi temperature was not observed for that combination of id, pressure, and value.
- (2 points) How can you fix things in Question 8 so that the result looks similar to that of Question 7 (for this, you are allowed to do something with the dataframe)?
The problem was that column value, that we are no longer interested in, so the fix is to get rid of that column first, before pivoting wider:
gases2 %>%
select(-value) %>%
pivot_wider(names_from = temperature, values_from = value_2)and now it works fine, with the numbers in the columns low and hi being values of value_2.
Points: \((2+4+3) + (2+3+4) + (3+3+2) = 9+9+8 = 26\).