STAC32 Assignment 8

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Bread

What makes bread rise? Specifically what are the effects of baking temperature and the amount of yeast on how much a loaf of bread will rise while baking? To find out, a batch of a certain bread mix was divided into 48 parts. Each part had a randomly chosen amount of yeast added (0.75, 1, or 1.25 teaspoons) and was then baked at a temperature of either 350 or 425 (degrees Fahrenheit). After baking, the height of each (very small) loaf of bread was measured (in inches). Apart from the yeast and the baking temperature, the ingredients for each small loaf were identical, so any differences in height can be attributed to one or both of the amount of yeast used and the baking temperature.

The data are in http://ritsokiguess.site/datafiles/bread_wide.csv.

(a) (1 point) Read in and display (most of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/bread_wide.csv"
bread <- read_csv(my_url)
Rows: 8 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): row, yeast0.75_temp350, yeast0.75_temp425, yeast1_temp350, yeast1_t...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bread

There are only 8 rows, so you should see all of them and most of the columns. The column names are rather long, so you may not see all the columns.

(b) (3 points) The data as you read in the values were stored in a spreadsheet with (originally) two rows of headers, one showing the amount of yeast and the other the baking temperature. (The headers were combined into one row for you.) Rearrange the data so that there is one column of heights, and columns showing the amount of yeast and the temperature that goes with each height. For maximum points, do your rearrangement with one command. Save your resulting dataframe.

This is one of the variations of pivot_longer, in particular the one with two names_to, because you want a column of yeast amounts and a column of temperatures. The two parts of the (current) column names are separated by an underscore, so:

bread %>% pivot_longer(-row, names_to = c("yeast", "temperature"),
                    names_sep = "_",
                    values_to = "height") -> bread_long
bread_long

Three points for that. -row is the best way to specify which columns to pivot-longer (“everything except for row); the other column names are rather long, but using starts_with("yeast") or similar is also reasonable.

If you didn’t manage that, do it in two steps: an ordinary pivot-longer:

bread %>% pivot_longer(-row, names_to = "yt", values_to = "height")

(you may have trouble coming up with a name for the column I called yt), and then use separate_wider_delim:

bread %>% pivot_longer(-row, names_to = "yt", values_to = "height") %>% 
  separate_wider_delim(yt, "_", names = c("yeast", "temperature"))

Two points for doing it this way. The last code chunk suffices for your answer, doing it this way. While you’re figuring out what to do, you should probably do the pivot-longer and then see what to do next, but it’s fine to hand in just the code chunk with the two commands in it.

(c) (2 points) Make a suitable graph of the three columns (not including row) in your final dataframe.

These columns are yeast, temperature, and height (as I called them; you can use your own names), two categorical and one quantitative, so a grouped boxplot is called for. To make that, choose one of your categorical variables to be x and the other is fill (or colour); the quantitative variable is y:

ggplot(bread_long, aes(x = yeast, y = height, fill = temperature)) + geom_boxplot()

There are three different values of yeast and only two of temperature, so I put yeast on the \(x\)-axis. It seems better to me to make the best use of the “real-estate” on the \(x\)-axis: there is lots of room for categories there, but having more than a few colours is difficult to sort out.

Having said that, for this question (with only three and two categories), I have no objection if you have three colours:

ggplot(bread_long, aes(fill = yeast, y = height, x = temperature)) + geom_boxplot()

This graph was not so hard to draw once you had the data laid out properly. (This is often the case: most of the work is tidying the data.) I don’t think you could draw the graph at all with the data laid out as it originally was.

Extra: assuming that a bigger height is better, it is best to use more yeast and a lower temperature. The layout of the graphs suggests that a lower temperature is better for any amount of yeast, and more yeast is better at any temperature. That is to say, there is no evidence of an interaction between temperature and yeast (if you think of this as an example of a two-way ANOVA). Further evidence of this is that the three sets of two graphs (or the two sets of three graphs, depending which way you drew it) are more or less parallel to each other, indicating that the effect of one variable does not depend on the level of the other one.

2 American Community Survey

The American Community Survey is a huge sample survey that addresses many aspects of American communities. The data in http://ritsokiguess.site/datafiles/acs4.txt, in aligned columns, contain estimates of the total housed population (that own or rent a place to live), the total number of renters, and the median rent, in two US states. The column called error contains standard errors of the estimates (obtained using methods like the ones in STAC53). The states are identified by name and number, the latter in the column geoid.

(a) (1 point) Read in and display the data.

There are only six rows and five columns, so you should see it all when you display your dataframe. The .txt on the end of the file should clue you in that there is something non-standard going on here; the question says “aligned columns”, so read_table is what you need, instead of the usual read_csv:

library(tidyverse)
my_url <- "http://ritsokiguess.site/datafiles/acs4.txt"
acs <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  geoid = col_character(),
  name = col_character(),
  variable = col_character(),
  estimate = col_double(),
  error = col_double()
)
acs

That looks good.

(b) (2 points) Create columns containing the values in estimate for each of the three items in variable. (That is to say, you should get three new columns; the names of those new columns are the items in variable.) This first attempt will probably give you six rows (we discuss why in the next part).

Run pivot_wider exactly as you would guess:

acs %>% pivot_wider(names_from = variable, values_from = estimate)

This looks weird, but it is the correct answer for this part. We are about to discuss why it came out this way.

(c) (2 points) Explain briefly why your output in the previous part came out as it did.

You are probably wondering where those missing values came from (or, why we didn’t get two rows, one for each state). We got the right columns (the values in the estimate column got distributed over the three new columns, which is correct). The problem is the rows. To think about what happened, let’s look back at the original data:

acs

and the code that we ran above:

acs %>% pivot_wider(names_from = variable, values_from = estimate)

What determines the row that each estimate value goes in is the combination of values of all the variables not named in the pivot_wider: in this case, it is geoid, name, and error. The values in the error column are all different, so all six combinations are different, so we still have six rows after the pivot_wider, with missing values where there is no data to go there.

Isolate that the problem is in the rows, and that what determines the row is the combination of all the other variables not named in the pivot_wider, and that the error values cause the problem. Or, if you like, say that if we didn’t have the error column, there would be only two name-geoid combinations, and we’d get the two rows we were expecting.

(d) (3 points) Using techniques learned in this course and your insight from the previous part, arrange the data to have three columns of estimate values whose names are the three items in variable, and only two rows, one for each state.

In the previous part, you discovered that the error column was the problem (or that the desired rearrangement has nothing to do with error), so you can safely remove it before doing the pivot_wider:

acs %>% 
  select(-error) %>% 
  pivot_wider(names_from = variable, values_from = estimate)

and this is now what we wanted. There is a general principle here: when you are about to do a pivot-wider, you may have columns that are not of interest to you, and also play no role in determining what row things go in. You should remove those columns before you do the pivot-wider.

Applying the rationale of the previous part: there are now only two columns not named in the pivot-wider, geoid and name, and there are only two combinations of those, because each ID goes with only one state.

Alternative approach: you might have recognized earlier that error was going to cause a problem, and removed it first (so that your answer to (b) is the same as my answer to (d)). This is fine, as long as you have an explanation somewhere equivalent to my (c) that makes it clear you understand why it is that the error column is the problematic one. “We don’t use error in this question” is not enough because the point is to understand why it is causing (or would cause) pivot_wider to give an unexpected answer.

Extra 1: The data came from here. There, they suggest using id_cols to specify which columns identify rows. This site was one of the first few hits for me, so it is not difficult to search for. However, id_cols is not something we did in lecture, so there’s no credit for it here (or minimal credit if you cite your source). If you are interested anyway, it goes like this:

acs %>% 
  pivot_wider(names_from = variable, values_from = estimate, id_cols = c(geoid, name))

You can put just one of geoid and name in the id_cols, but then only the one you put there will show up in the result. (You can also say id_cols = -error to achieve the same thing as I just got: “the error column does not identify the individual states”.)

This does the same thing as we did by hand: it removes the column error that contains neither column names nor column values nor row identifiers, and then pivots wider. (The logic is “use only the column(s) named to decide which row each data value goes in”.)

Three points if you remove the error column and re-do your pivot-wider. No points if you use id_cols without saying where it came from, and one if you cite your source in a way that can be checked.

Extra 2: if you want to keep the error values and have them go along with the estimate values, you can do something like this:

acs %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, error))

There are now a lot of columns. pivot_wider has created two sets of value columns. The ones beginning with estimate are the same ones we had before, but there are also columns with names beginning error that are the standard errors for that variable in that state. The pivot_wider now mentions all the other variables apart from the ones that identify states, so we correctly get two rows, one for each state. When there is more than one variable in values_from, pivot_wider glues the name of each variable onto the front of the names of the new columns, so that you can tell which is which.

If it were not for the fact that the column names already had underscores in them, you would be able to take this dataframe and pivot it longer by the method of the other question and get the original dataframe back that you read from the file.