Worksheet 9

Published

October 30, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

American Community Survey

The American Community Survey is a huge sample survey that addresses many aspects of American communities. The data in http://ritsokiguess.site/datafiles/acs4.txt, in aligned columns, contain estimates of the total housed population (that own or rent a place to live), the total number of renters, and the median rent, in two US states. The column called error contains standard errors of the estimates (obtained using methods like the ones in STAC53). The states are identified by name and number, the latter in the column geoid.

  1. Read in and display the data.

There are only six rows and five columns, so you should see it all when you display your dataframe. The .txt on the end of the file should clue you in that there is something non-standard going on here; the question says “aligned columns”, so read_table is what you need, instead of the usual read_csv:

library(tidyverse)
my_url <- "http://ritsokiguess.site/datafiles/acs4.txt"
acs <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  geoid = col_character(),
  name = col_character(),
  variable = col_character(),
  estimate = col_double(),
  error = col_double()
)
acs

That looks good.

  1. Create columns containing the values in estimate for each of the three items in variable. (That is to say, you should get three new columns; the names of those new columns are the items in variable.) This first attempt will probably give you six rows and some missing values (we discuss why in the next part).

Run pivot_wider exactly as you would guess:

acs %>% pivot_wider(names_from = variable, values_from = estimate)

This looks weird, but it is the correct answer for this part. We are about to discuss why it came out this way.

  1. Explain briefly why your output in the previous part came out as it did.

You are probably wondering where those missing values came from (or, why we didn’t get two rows, one for each state). We got the right columns (the values in the estimate column got distributed over the three new columns, which is correct). The problem is the rows. To think about what happened, let’s look back at the original data:

acs

and the code that we ran above:

acs %>% pivot_wider(names_from = variable, values_from = estimate)

What determines the row that each estimate value goes in is the combination of values of all the variables not named in the pivot_wider: in this case, it is geoid, name, and error. The values in the error column are all different, so all six combinations are different, so we still have six rows after the pivot_wider, with missing values where there is no data to go there.

Isolate that the problem is in the rows, and that what determines the row is the combination of all the other variables not named in the pivot_wider, and that the error values cause the problem. Or, if you like, say that if we didn’t have the error column, there would be only two name-geoid combinations, and we’d get the two rows we were expecting.

We as statisticians would call name and geoid “identifier variables” in that either or both of them identify the “experimental units”, states here. Database people would call these two columns “keys”. Keys can be “natural” (like the name of the state), or “surrogate” (something made-up that also identifies the state like geoid). Keys are often used to make sure a left_join matches the right things: in the lecture example of nails at Canadian Tire, each type of nail had its own ID (a so-called “primary key”) so that we could look it up in the catalog to see what it actually was.

  1. Using techniques learned in this course and your insight from the previous part, arrange the data to have three columns of estimate values whose names are the three items in variable, and only two rows, one for each state.

In the previous part, you discovered that the error column was the problem (or that the desired rearrangement has nothing to do with error), so you can safely remove it before doing the pivot_wider:

acs %>% 
  select(-error) %>% 
  pivot_wider(names_from = variable, values_from = estimate)

and this is now what we wanted. There is a general principle here: when you are about to do a pivot-wider, you may have columns that are not of interest to you, and also play no role in determining what row things should go in. You should remove those columns before you do the pivot-wider.

Applying the rationale of the previous part: there are now only two columns not named in the pivot-wider, geoid and name, and there are only two combinations of those, because each ID goes with only one state.

Alternative approach: you might have recognized earlier that error was going to cause a problem, and removed it first (so that your answers to the two pivot_wider parts are the same). This is fine, as long as you have an explanation somewhere that makes it clear you understand why it is that the error column is the problematic one. “We don’t use error in this question” is not enough because the point is to understand why it is causing (or would cause) pivot_wider to give an unexpected answer.

Extra 1: The data came from here. There, they suggest using id_cols to specify which columns identify rows. This site was one of the first few hits for me, so it is not difficult to search for. However, id_cols is not something we did in lecture, so there was credit for it here (when this was an assignment problem; minimal credit if you cited your source). If you are interested anyway, it goes like this:

acs %>% 
  pivot_wider(names_from = variable, values_from = estimate, id_cols = c(geoid, name))

You can put just one of geoid and name in the id_cols, but then only the one you put there will show up in the result. (You can also say id_cols = -error to achieve the same thing as I just got: “the error column does not identify the individual states”.)

This does the same thing as we did by hand: it removes the column error that contains neither column names nor column values nor row identifiers, and then pivots wider. (The logic is “use only the column(s) named to decide which row each data value goes in”.)

When this was an assignment problem (worth three points): three points if you remove the error column and re-do your pivot-wider. No points if you use id_cols without saying where it came from, and one if you cite your source in a way that can be checked. The lecture material has everything you need to solve this problem. Why are you in this course if you don’t want to use it?

Extra 2: if you want to keep the error values and have them go along with the estimate values, you can do something like this:

acs %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, error))

There are now a lot of columns. pivot_wider has created two sets of value columns. The ones beginning with estimate are the same ones we had before, but there are also columns with names beginning error that are the standard errors for that variable in that state. The pivot_wider now mentions all the other variables apart from the ones that identify states, so we correctly get two rows, one for each state. When there is more than one variable in values_from, pivot_wider glues the name of each original variable onto the front of the names of the new columns, so that you can tell which is which.

If it were not for the fact that the column names already had underscores in them, you would be able to take this dataframe and pivot it longer by the method of the other question and get the original dataframe back that you read from the file. To illustrate:

acs %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, error), names_sep = ":")

The values in the original column variable are separated by underscores; this time I used a colon to separate the original variables estimate and error from what they are estimates and errors of. So now I should be able to pivot_longer, allowing for the fact that the column names encode two things, one of which should make new column names:

acs %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, error), names_sep = ":") %>% 
  pivot_longer(starts_with("e"), names_to = c(".value", "variable"), names_sep = ":")

and we are indeed back where we started: this funky1 pivot_longer is the inverse of the funky pivot_wider that preceded it.

The boiling point of water

The boiling point of water is commonly known to be 100 degrees C (212 degrees F). But the actual boiling point of water depends on atmospheric pressure; when the pressure is lower, the boiling point is also lower. For example, at higher altitudes, the atmospheric pressure is lower because the air is thinner, so that in Denver, Colorado, which is 1600 metres above sea level, water boils at around 95 degrees C. Source.

Some data were collected on the atmospheric pressure at seventeen locations (pressure in the data file, in inches of mercury) and the boiling temperature of water at those locations (boiling, in degrees F). This is (evidently) American data. The data are in http://ritsokiguess.site/datafiles/boiling-point.csv. Our aim is to predict boiling point from atmospheric pressure.

  1. Read in and display (some of) the data.

Solution

This is a .csv file, so no great difficulty:

my_url <- "http://ritsokiguess.site/datafiles/boiling-point.csv"
boiling_point <- read_csv(my_url)
Rows: 17 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): boiling, pressure

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
boiling_point

There are correctly 17 rows of data with two columns boiling and pressure as promised.

\(\blacksquare\)

  1. Draw a suitable plot of these data.

Solution

There are two quantitative variables, so a scatterplot is the thing. We are trying to predict boiling point, so that goes on the \(y\)-axis:

ggplot(boiling_point, aes(x = pressure, y = boiling)) + geom_point() + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Adding a smooth trend will help us to get a sense of what kind of relationship there is. This is better than adding a regression line to this plot, because we don’t know whether the relationship is straight.

Comment on this one below.

\(\blacksquare\)

  1. Comment briefly on your plot and any observations that appear not to belong to the pattern shown by the rest of the observations.

Solution

There is a clear upward trend, one that appears to be linear, apart from the two observations on the left that have a higher boiling point than you would expect for such a low pressure.

Do not, at this point, suggest removing these points. If they are genuine, they have to be included in our model. Removing them for no reason (other than “they don’t fit a line”) is scientific dishonesty, because you want a model that will generalize to future observations, including ones like those on the left. If you want to make a suggestion, suggest checking whether those points on the left are errors, and if they are errors, removing them. That is the only justification for removing data points: in that case, they are not like the others, for a reason you can explain.

\(\blacksquare\)

  1. Fit a suitable linear regression, and display the results. (Do this even if you think this is not appropriate.)

Solution

You probably think that those points over on the left mean that we should not fit a regression at all (which is the reason for my parenthetical comment in the question), but fit the regression here anyway. We will critique it shortly.

boiling_point.1 <- lm(boiling ~ pressure, data = boiling_point)
summary(boiling_point.1)

Call:
lm(formula = boiling ~ pressure, data = boiling_point)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.41483 -0.91550 -0.05148  0.76941  2.72840 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 163.9307     2.6551   61.74  < 2e-16 ***
pressure      1.5796     0.1051   15.04 1.88e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.291 on 15 degrees of freedom
Multiple R-squared:  0.9378,    Adjusted R-squared:  0.9336 
F-statistic:   226 on 1 and 15 DF,  p-value: 1.879e-10

\(\blacksquare\)

  1. Comment briefly on whether the slope makes sense, and on the overall fit of the model.

Solution

There are actually three things I would like to see:

  • the slope is positive, as predicted in the question (a lower pressure goes with a lower boiling point, and thus higher with higher also).
  • the slope is (very strongly) significant, meaning that the positive slope here is far from just chance.
  • the R-squared is 94%, meaning that the regression fits well even despite the two points on the left of the scatterplot.

\(\blacksquare\)

  1. Make two suitable plots than can be used to assess the appropriateness of this regression.

Solution

The two plots in question are of the residuals vs. fitted values, and a normal quantile plot of the residuals. Don’t forget to use your fitted model object (my boiling_point.1) as the “dataframe” to feed into ggplot:

ggplot(boiling_point.1, aes(x = .fitted, y = .resid)) + geom_point()

and

ggplot(boiling_point.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

\(\blacksquare\)

  1. When you looked at your scatterplot, you may have identified some observations that did not follow the pattern of the others. Describe briefly how these observations show up on the two plots you just drew.

Solution

On the scatterplot, we saw two points that were off a linear trend on the left. These had low pressure and a higher boiling point than expected given that low pressure. These will show up as very positive residuals:

  • on the residuals vs fitted values, they are up in the top left corner
  • on the normal quantile plot of residuals, they are the two most positive residuals, at the top right.

Try to be as brief as this (you may need to think about what are the most relevant things to say).

\(\blacksquare\)

  1. It turns out that the two observations with the lowest pressure are errors. Create a new dataframe with these observations removed, and repeat the regression. (You do not need to make the residual plots.)

Solution

Find a way to select the two observations with the lowest pressure. The easiest way is to note that their pressure values, whatever they are, must be less than about 21.25 (the major ticks on my scatterplot are 2.5, so the minor ticks are 1.25, and so the vertical line in the graph background is 1.25 less than 22.5, ie. at 21.25.) That would give this:

boiling_point %>% 
  filter(pressure > 21.25) -> boiling_point_no_lo
boiling_point_no_lo

Correctly 15 rows left.

You could use anything between 21 and 22 as your dividing line.

Another way is to sort on pressure and then remove the two lowest ones:

boiling_point %>% arrange(pressure) %>% 
  slice(-(1:2))

slice with negative input says “everything but those rows”, or you could feed slice rows 3 through 17 of the original dataframe.

The data values are already sorted by pressure, in fact, so if you don’t do the sorting yourself you need to look and see that the two values you want to remove actually are the first two.

A third way is to use slice_max. It combines the slice with the sorting, so that you can do it all in one step:

boiling_point %>% 
  slice_max(pressure, n = 15)

The n = 15 is to say that you want to keep the \(17-2 = 15\) rows that have the largest pressure values, which is the same thing as throwing away the two rows with the smallest ones.

I had a hard time naming this dataset. I didn’t want to use numbers because I am using those for models.

All right, the regression:

boiling_point.2 <- lm(boiling ~ pressure, data = boiling_point_no_lo)
summary(boiling_point.2)

Call:
lm(formula = boiling ~ pressure, data = boiling_point_no_lo)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.16358 -0.22154  0.01851  0.21714  0.76407 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 157.54528    1.24593  126.45  < 2e-16 ***
pressure      1.81477    0.04827   37.59 1.19e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5041 on 13 degrees of freedom
Multiple R-squared:  0.9909,    Adjusted R-squared:  0.9902 
F-statistic:  1413 on 1 and 13 DF,  p-value: 1.193e-14

\(\blacksquare\)

  1. Compare the slope and the R-squared from this regression with the values you got in the first regression. Why is it sensible that the values differ in the way they do?

Solution

I’m going to make a little Markdown table out of the values, for ease of comparison:2

Regression Slope R-squared
Regression 1 1.58 0.938
Regression 2 1.815 0.991

The second regression has both a more positive slope and a bigger R-squared than the first one.

The slope is bigger (more positive) after we remove the two observations because they had a higher than expected boiling point, and that pulled the line up on the left compared to the regression without those two observations. Thus the first line is less steep than the second, meaning that it has a less positive slope.

The R-squared is bigger because all the points, apart from the two we removed, are very close to a straight-line trend, and so the second regression should be, and is, very close to a perfect fit.

Extra: when you make a table like this, you should really pull the values out of the R output and not type them again. (What if you change the data, and then your numbers will be wrong? What if you realize you made a mistake earlier?)

Here are the ingredients I used, followed by my “recipe”:

  • $ for pulling things out of other things
  • tidy from the broom package for extracting intercept and slope
  • glance from broom for extracting R-squared (I probably didn’t need to use broom but it’s easier this way)
  • round to round a result to a fixed number of decimals
  • list to make a structure to store my results in and get them out of
  • creating a variable to hold my number of decimal places, in case I change my mind after seeing the results

and my code:

library(broom)
decimals <- 3
ans <- list(
  r1 = list(
    slope = round(tidy(boiling_point.1)$estimate[2], decimals),
    rsq = round(glance(boiling_point.1)$r.squared, decimals)
  ),
  r2 = list(
    slope = round(tidy(boiling_point.2)$estimate[2], decimals),
    rsq = round(glance(boiling_point.2)$r.squared, decimals)
  )
)
ans
$r1
$r1$slope
[1] 1.58

$r1$rsq
[1] 0.938


$r2
$r2$slope
[1] 1.815

$r2$rsq
[1] 0.991

Steps:

  • load broom
  • set my preferred number of decimal places
  • create a list called ans to hold all the results. Lists are a very flexible data structure in R; you can basically put anything in an R list, even another list, which we are going to do shortly.
  • inside ans, create a list r1 that will hold the results we want from the first regression.
  • r1 contains a thing called slope that will contain the slope from the first regression:
    • feed the first model into tidy
    • extract the thing called estimate, which contains the intercept and slope in that order
    • extract the second one of those, which is the slope
    • round it to your pre-specified number of decimal places
  • r1 also contains a thing called rsq which contains the R-squared for the first regression:
    • feed the model into glance
    • extract the thing called r.squared
    • round it to the right number of decimals
  • repeat the whole thing for the second regression by creating a list r2 with the same things in it (copy, paste, edit)

This seems very complicated, but (i) after you’ve done it once, you get the hang of it, and (ii) the payoff comes when we make the table because we calculated exactly what we need. The rounding is always the most annoying part.

There are several ways to make tables in Quarto. I like the one below:

Regression      |  Slope                |    R-squared
----------------|-----------------------|------------------------
Regression 1    | `{r} ans$r1$slope`  |    `{r} ans$r1$rsq`
Regression 2    | `{r} ans$r2$slope`  |    `{r} ans$r2$rsq`

Columns are separated by |, which is the Linux “pipe” symbol.3 This means that columns don’t have to line up. The long row of - draws a horizontal rule between the first row (formatted as headings) and the rest of the table. To insert the values we went to such great trouble to obtain, we use “inline R code”. To do this, a backtick followed by {r} followed by some code followed by another backtick displays the result of running this code.4 This can get unwieldy pretty quickly, so the idea is to structure things so that what you are inserting here is concise. Our list structure in ans makes that happen: for example, to get the R-squared value for the second regression, you ask for the thing called rsq within the list r2 (representing the second regression) within the big list ans, using dollar signs to dig down through the data structure we made.

I got this idea from Tristan Mahr, who has some extra cleverness to automate the building of the list that I called ans, rather than using copy-paste as I did.

\(\blacksquare\)

Footnotes

  1. See this dictionary: the original meaning was “smelly”, but then it came to be applied to music that resembled African-American blues or gospel, particularly in rhythm, and then to anything unconventional, which is my meaning here.↩︎

  2. You only need enough decimals to make the comparison clear. Two is enough, but I went with three. The third digit of the first slope is zero, which ought to be shown really, but isn’t.↩︎

  3. Not to be confused with %>%, which is the tidyverse pipe.↩︎

  4. You can do this anywhere in a Quarto document, not just in a table.↩︎