Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.
If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.
Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Chick weights
An experiment was carried out of the effect of diet on the early growth of chicks. 50 chicks were randomly allocated to one of four different diets, and at various times after each chick’s birth, their weight was measured (in grams). The data in http://ritsokiguess.site/datafiles/chick_weights.csv contain identifiers for the chick and for the diet that the chick was on (a chick was on the same diet all the way through the experiment). There are a lot of columns: the columns with names like weight_1 are the weight of the chick in that row at the time point shown (time point 1 in this case), and the columns with names like Time_2 are the time, in days since birth, that the time point shown corresponds to (time point 2 in this case).
Rows: 50 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Diet
dbl (25): Chick, weight_1, weight_2, weight_3, weight_4, weight_5, weight_6,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chick_weights
There are 50 rows, one per chick, with all the weights (at different time points) and all the times of weighing for that chick all along one rather wide row.
Extra: there is a considerable backstory here, as you might expect.
These data come from the built-in dataset ChickWeight (not to be confused with chickwts that we saw before):
ChickWeight %>%as_tibble() -> cwcw
This, as you see, is more or less where you ended up after tidying, so I had to deliberately untidy it to give you something to tidy. (The original is in some funky format, so I made it into a dataframe.)
First, I wanted to identify the different times the weights were measured at. count is a handy way to do that (with the odd effect that I count up how many times each value of Time appears, and then throw away the counts!)
The reason for this is that I wanted the columns of the dataframe I gave you to have names like Time_1 and weight_2, meaning the first time a chick was weighed, and the weight of a chick at their second weighing, respectively. I also realized that I needed to have two quantitative variables to make this kind of problem work, and that Time and weight would serve this purpose here.
Next, I look up the times in my dataframe of times I just made:
cw %>%left_join(times, join_by(Time)) -> cw1cw1
Now, I have a column (that I called rep) that indexes the weights and times. For example, for chick number 1, its third weighing (when rep is 3) was 4 days after birth, and at that time, it weighed 64 grams.
Now I have the machinery I need to make columns with the right names. This is a pivot-wider, but an odd one with two things in values_from (which is thus the inverse of the pivot-longer that I am making you do later):
What this variation on pivot_wider does is to use the values in names_fromand the names of the columns in values_from1 to make column names out of. It has to do something, because if it just called the column 1, you wouldn’t know whether it was the first weight or the first time.
This is the dataframe I saved for you. However, there was one more thing I needed to do, which is the reason for that extra mutate. The first time I went through this problem, I didn’t change Diet, and so those were still numbers. I got all the way to the spaghetti plot at the end, and when I tried to make the diets different colours, ggplot thought that Diet was quantitative, and the way it handles colours for quantitative variables is that it uses a continuous scale of shades of blue. To fake up what happened to me:
ggplot(cw1, aes(x = Time, y = weight, colour =as.numeric(Diet), group = Chick)) +geom_point() +geom_line()
That makes the diets just a bit difficult to tell apart!
(The principle is that to represent a number by a colour, you have to be able to represent things like 3.5 as well as whole numbers.)
Having realized why this was happening, I then went back and put a D on the front of the diet names (so that they became D1 through D4), and then they would be guaranteed categorical when the time came. Then, everything works. (In the original data, Diet is stored as a factor that happens to have numerical levels,2 so everything works, but when I save it and you read it back in, the diets are read in as numbers.)
Rearrange the data to have a column containing all the weights and a column containing the times (in days since birth) that those weights were measured, identified by the Chick and the Diet the chick was on. Save your new dataframe.
This is a pivot-longer, but one of the difficult ones. The columns we’re going to pivot-longer have names like weight_2 or Time_3; the second bit, after the underscore, is a number that’s going into a column (that I’m going to call rep, but you can call it whatever makes sense to you), but the first bit of these column names is going to become the name of a column, as in “a hairy one” in lecture where the variables we measured were called y and z and they needed to end up in their own columns. The columns we want to pivot-longer all have an underscore in their names, so this will do it:
Now we have a lot of rows: 50 chicks times 12 measurements on each one, some of which are actually missing. The rep column has probably served its purpose (it tells you, for example, that the first chick, which was on diet D1, weighed 125 grams at its 8th measurement, which was 14 days after the chick was born).
Note that the code in these has novalues_to; the values are going into the appropriate one of the new columns we’re using .values to make, the weights into weight and the times into Time.
This layout of the data allows for the possibility that, say, the 10th weight measurement of chick 20 was taken at a different time than the 10th weight measurement of chick 21. (In that case, Time_10 would have different values for chicks 20 and 21). I don’t think that’s actually the case here, however, apart from some missing values, so that the original Time_10 column has a lot of repeats.
Make a spaghetti plot: that is to say, plot weight against time for each chick, joining the points for the same chick by lines, and colouring the points and lines by diet. Hint: follow the model in the lecture notes.
The point of this question is to sort-of convince you that once you have data in the right layout, making a graph or a summary is not too hard. Plus, graphs are pretty. Use the dataframe that I called chick_weights_long, the one that you went to such trouble to make:
ggplot(chick_weights_long, aes(x = Time, y = weight, colour = Diet, group = Chick)) +geom_point() +geom_line()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_line()`).
You’ll get a warning about missing values (not all the chicks were weighed at all 12 times). That can safely be ignored.
As for the code, Time is time, on the \(x\)-axis (here, you could probably also use rep or its equivalent in your code). weight goes on the \(y\)-axis. The diets are going to be indicated by different colours, so colour is Diet. The extra unexpected thing is the group, which is probably new to you. The problem with having colour as Diet is that without the group =, all the points of the same colour will get joined by lines, and that’s not what we want. The trick is that group, if we have one, overrides the colour, so that in this case the only points joined by a line are ones that belong to the same chick. (Chick is serving the same purpose as an id variable that uniquely identifies the individual chicks, something that was made for us here rather than our having to make it ourselves.)
All of the chicks are growing over time to some extent (as you would expect). As for the diets:
diet D1 looks bad (most of the red traces end up at the bottom)
diet D2 is all over the place (the green traces are sometimes at the top and sometimes lower down)
diet D3 seems to be good (pale blue traces near the top)
diet D4 is mostly good (purple traces middling to high).
When we come to study data like this in D29, we would expect to see a significant diet effect here, once we take care to allow for the fact that we have multiple measurements on the same chicks. Likewise, there would be a clear time effect (all the chicks are growing). We would also test for an interaction between diet and time, to see whether the time effect depends on diet (the manner of growth depends on the diet, not just that one diet was best all the way along).
Extra: the other graph you could make here is an “interaction plot” of the mean weights by time and diet. To kick this off, use group-by and summarize to work out the mean weights of all the combinations of Diet and Time:
`summarise()` has grouped output by 'Diet'. You can override using the
`.groups` argument.
We have to be a bit careful about missing values. There are some missing times, which we want nothing to do with, and there are also some missing weights observed at reasonable times, that we want to not count in the means. The first of these is accounted for by dropping rows with missing Time, and the second is accounted for by using na.rm in the calculation of the mean (to remove the missing values before the mean is calculated).
To make an interaction plot, we plot the means against time, joining the points for the same Diet by lines. This also makes use of group =:
chick_weights_long %>%drop_na(Time) %>%group_by(Diet, Time) %>%summarize(mean_weight =mean(weight, na.rm =TRUE)) %>%ggplot(aes(x = Time, y = mean_weight, colour = Diet, group = Diet)) +geom_point() +geom_line()
`summarise()` has grouped output by 'Diet'. You can override using the
`.groups` argument.
The thing to remember in an interaction plot is that you have both a colour and a group, and they are the same. Note also that you are this time making a graph of the summary (the means), so you pipe the dataframe you made just above straight intoggplot (or, save the summary and use that in ggplot, your choice).
If you have used one of these graphs in your second course, you might remember that what you do with it is to judge whether the coloured lines are more or less parallel (no interaction) or not (interaction). These are sort-of parallel in that they go up in more or less the same way, but Diet 3 overtakes Diet 4 on about day 14 and the chicks on Diet 3 are heavier on average from there on. If a suitable test (D29) has a significant interaction between time and diet, it will be because of that inconsistency of pattern.
Interaction plots are easier to interpret than spaghetti plots (only one “trace” for each diet), but sometimes that is an oversimplification: interaction plots show means, but they do not show spread at all. One of the reasons that spaghetti plots are harder to interpret is that they reveal just how much individual observations (chicks) vary from each other, even ones on the same diet, and that is also worth knowing. Looking at the interaction plot, the amount of variability is the thing that tells you whether those four diets are significantly different, or whether there is so much variability that the differences you see between the means are just chance.
American Community Survey
The American Community Survey is a huge sample survey that addresses many aspects of American communities. The data in http://ritsokiguess.site/datafiles/acs4.txt, in aligned columns, contain estimates of the total housed population (that own or rent a place to live), the total number of renters, and the median rent, in two US states. The column called error contains standard errors of the estimates (obtained using methods like the ones in STAC53). The states are identified by name and number, the latter in the column geoid.
Read in and display the data.
There are only six rows and five columns, so you should see it all when you display your dataframe. The .txt on the end of the file should clue you in that there is something non-standard going on here; the question says “aligned columns”, so read_table is what you need, instead of the usual read_csv:
Create columns containing the values in estimate for each of the three items in variable. (That is to say, you should get three new columns; the names of those new columns are the items in variable.) This first attempt will probably give you six rows and some missing values, which is fine for the moment (we discuss why there are missing values in the next question).
This looks weird, but it is the correct answer for this question. We are about to discuss why it came out this way.
Explain briefly why your output in the previous question came out as it did.
You are probably wondering where those missing values came from (or, why we didn’t get two rows, one for each state). We got the right columns (the values in the estimate column got distributed over the three new columns, which is correct). The problem is the rows. To think about what happened, let’s look back at the original data:
What determines the row that each estimate value goes in is the combination of values of all the variables not named in the pivot_wider: in this case, it is geoid, name, anderror. The values in the error column are all different, so all six combinations are different, so we still have six rows after the pivot_wider, with missing values where there is no data to go there.
Isolate that the problem is in the rows, and that what determines the row is the combination of all the other variables not named in the pivot_wider, and that the error values cause the problem. Or, if you like, say that if we didn’t have the error column, there would be only two name-geoid combinations, and we’d get the two rows we were expecting.
We as statisticians would call name and geoid “identifier variables” in that either or both of them identify the “experimental units”, states here. Database people would call these two columns “keys”. Keys can be “natural” (like the name of the state), or “surrogate” (something made-up that also identifies the state like geoid). Keys are often used to make sure a left_join matches the right things: in the lecture example of nails at Canadian Tire, each type of nail had its own ID (a so-called “primary key”) so that we could look it up in the catalog to see what it actually was.
Using techniques learned in this course and your insight from the previous question, arrange the data to have three columns of estimate values whose names are the three items in variable, and only two rows, one for each state.
In the previous part, you discovered that the error column was the problem (or that the desired rearrangement has nothing to do with error), so you can safely remove it before doing the pivot_wider:
and this is now what we wanted. There is a general principle here: when you are about to do a pivot-wider, you may have columns that are not of interest to you, and also play no role in determining what row things should go in. You should remove those columns before you do the pivot-wider.
Applying the rationale of the previous part: there are now only two columns not named in the pivot-wider, geoid and name, and there are only two combinations of those, because each ID goes with only one state.
Alternative approach: you might have recognized earlier that error was going to cause a problem, and removed it first (so that your answers to the two pivot_wider parts are the same). This is fine, as long as you have an explanation somewhere that makes it clear you understand why it is that the error column is the problematic one. “We don’t use error in this question” is not enough because the point is to understand why it is causing (or would cause) pivot_wider to give an unexpected answer.
Extra 1: The data came from here. There, they suggest using id_cols to specify which columns identify rows. This site was one of the first few hits for me, so it is not difficult to search for. However, id_cols is not something we did in lecture, so there was no credit for it here (when this was an assignment problem; minimal credit if you cited your source). If you are interested anyway, it goes like this:
You can put just one of geoid and name in the id_cols, but then only the one you put there will show up in the result. (You can also say id_cols = -error to achieve the same thing as I just got: “the error column does not identify the individual states”.)
This does the same thing as we did by hand: it removes the column error that contains neither column names nor column values nor row identifiers, and then pivots wider. (The logic is “use only the column(s) named to decide which row each data value goes in”.)
When this was an assignment problem (worth three points): three points if you remove the error column and re-do your pivot-wider. No points if you use id_cols without saying where it came from, and one if you cite your source in a way that can be checked.3 The lecture material has everything you need to solve this problem. Why are you in this course if you don’t want to use it?
Extra 2: if you want to keep the error values and have them go along with the estimate values, you can do something like this:
There are now a lot of columns. pivot_wider has created two sets of value columns. The ones beginning with estimate are the same ones we had before, but there are also columns with names beginning error that are the standard errors for that variable in that state. The pivot_wider now mentions all the other variables apart from the ones that identify states, so we correctly get two rows, one for each state. When there is more than one variable in values_from, pivot_wider glues the name of each original variable onto the front of the names of the new columns, so that you can tell which is which.
If it were not for the fact that the column names already had underscores in them, you would be able to take this dataframe and pivot it longer to get the original dataframe back that you read from the file. To illustrate, separating things now by a colon to keep things straight:
The values in the original column variable are separated by underscores; this time I used a colon to separate the original variables estimate and error from what they are estimates and errors of. So now I should be able to pivot_longer, allowing for the fact that the column names encode two things, one of which should make new column names:
and we are indeed back where we started: this funky4pivot_longer is the inverse of the funky pivot_wider that preceded it.
Footnotes
This seems weird, but it is actually the right thing to do, if you stop and think about it.↩︎
If you remember the problem with houses that had 3 or 4 bedrooms: we had to turn the number of bedrooms into a factor to get a boxplot to work.↩︎
Chatgpt and the like, you are reminded, are not reliable sources.↩︎
See this dictionary: the original meaning was “smelly”, but then it came to be applied to music that resembled African-American blues or gospel, particularly in rhythm, and then to anything unconventional, which is my meaning here.↩︎