Worksheet 9

Published

October 30, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

American Community Survey

The American Community Survey is a huge sample survey that addresses many aspects of American communities. The data in http://ritsokiguess.site/datafiles/acs4.txt, in aligned columns, contain estimates of the total housed population (that own or rent a place to live), the total number of renters, and the median rent, in two US states. The column called error contains standard errors of the estimates (obtained using methods like the ones in STAC53). The states are identified by name and number, the latter in the column geoid.

Read in and display the data.

Create columns containing the values in estimate for each of the three items in variable. (That is to say, you should get three new columns; the names of those new columns are the items in variable.) This first attempt will probably give you six rows and some missing values (we discuss why in the next part).

Explain briefly why your output in the previous part came out as it did.

Using techniques learned in this course and your insight from the previous part, arrange the data to have three columns of estimate values whose names are the three items in variable, and only two rows, one for each state.

The boiling point of water

The boiling point of water is commonly known to be 100 degrees C (212 degrees F). But the actual boiling point of water depends on atmospheric pressure; when the pressure is lower, the boiling point is also lower. For example, at higher altitudes, the atmospheric pressure is lower because the air is thinner, so that in Denver, Colorado, which is 1600 metres above sea level, water boils at around 95 degrees C. Source.

Some data were collected on the atmospheric pressure at seventeen locations (pressure in the data file, in inches of mercury) and the boiling temperature of water at those locations (boiling, in degrees F). This is (evidently) American data. The data are in http://ritsokiguess.site/datafiles/boiling-point.csv. Our aim is to predict boiling point from atmospheric pressure.

Read in and display (some of) the data.

Draw a suitable plot of these data.

Comment briefly on your plot and any observations that appear not to belong to the pattern shown by the rest of the observations.

Fit a suitable linear regression, and display the results. (Do this even if you think this is not appropriate.)

Comment briefly on whether the slope makes sense, and on the overall fit of the model.

Make two suitable plots than can be used to assess the appropriateness of this regression.

When you looked at your scatterplot, you may have identified some observations that did not follow the pattern of the others. Describe briefly how these observations show up on the two plots you just drew.

It turns out that the two observations with the lowest pressure are errors. Create a new dataframe with these observations removed, and repeat the regression. (You do not need to make the residual plots.)

Compare the slope and the R-squared from this regression with the values you got in the first regression. Why is it sensible that the values differ in the way they do?