Worksheet 10

Published

November 12, 2025

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS, exclude = "select")
library(broom)

simple regression

The boiling point of water

The boiling point of water is commonly known to be 100 degrees C (212 degrees F). But the actual boiling point of water depends on atmospheric pressure; when the pressure is lower, the boiling point is also lower. For example, at higher altitudes, the atmospheric pressure is lower because the air is thinner, so that in Denver, Colorado, which is 1600 metres above sea level, water boils at around 95 degrees C. Source.

Some data were collected on the atmospheric pressure at seventeen locations (pressure in the data file, in inches of mercury) and the boiling temperature of water at those locations (boiling, in degrees F). This is (evidently) American data. The data are in http://ritsokiguess.site/datafiles/boiling-point.csv. Our aim is to predict boiling point from atmospheric pressure.

  1. Read in and display (some of) the data.
  1. Draw a suitable plot of these data.
  1. Comment briefly on your plot and any observations that appear not to belong to the pattern shown by the rest of the observations.
  1. Fit a suitable linear regression, and display the results. (Do this even if you think this is not appropriate.)
  1. Comment briefly on whether the slope makes sense, and on the overall fit of the model.
  1. Make two suitable plots than can be used to assess the appropriateness of this regression.
  1. When you looked at your scatterplot, you may have identified some observations that did not follow the pattern of the others. Describe briefly how these observations show up on the two plots you just drew.
  1. It turns out that the two observations with the lowest pressure are errors. Create a new dataframe with these observations removed, and repeat the regression. (You do not need to make the residual plots.)
  1. Compare the slope and the R-squared from this regression with the values you got in the first regression. Why is it sensible that the values differ in the way they do?

Thermal spray coatings

A coating is sprayed onto stainless steel, and the strength of the bond between the coating and the stainless steel is measured (in megapascals). Five different thicknesses of coating are used (measured in micrometres), and an engineer is interested in the relationship between the thickness of the coating and its strength. Some data are in http://ritsokiguess.site/datafiles/coatings.csv.

  1. Read in and display (some of) the data.
  1. Draw a suitable graph that illustrates how the bond thickness influences the strength.
  1. Comment briefly on the kind of relationship you see here, if any.
  1. Fit a straight-line regression and display the results. (You will have an opportunity to criticize it shortly.)
  1. By making a suitable plot, demonstrate that the relationship is actually curved rather than linear.
  1. Add a squared term in thickness to your regression, and display the output.
  1. The Estimate for thickness-squared is very small in size. Why, nonetheless, was it definitely useful to add that squared term?
  1. Is the plot of residuals vs fitted values better from your second regression than it was from the first one? Draw it, and explain briefly.

multiple regression

Houses in Duke Forest, North Carolina

The data in http://ritsokiguess.site/datafiles/duke_forest.csv are of houses that were sold around November 2020 in the Duke Forest area of Durham, North Carolina. For each house, the selling price (in US $), called price, was recorded, along with some other features of the house:

  • bed: the number of bedrooms
  • bath: the number of bathrooms
  • area: the area of the inside of the house, in square feet
  • year_built: the year the house was originally built

Our aim is to predict the selling price of a house from its other features. There are 97 houses in the data set.

Note: this is rather long, but I wanted to give you a chance to practice everything.

  1. Read in and display (some of) the data.
  1. Make a graph of selling price against each of the explanatory variables, using one ggplot line.
  1. Comment briefly on your plots.
  1. Fit a regression predicting price from the other variables, and display the results.
  1. What is the meaning of the number in the bath row in the Estimate column?
  1. Plot the residuals from your regression against the fitted values. What evidence is there that a transformation of the selling prices might be a good idea? (Hint: look at the right side of your graph.)
  1. Run Box-Cox. What transformation of price is suggested, if any?
  1. Rerun your regression with a suitably transformed response variable, and display the results.
  1. Confirm that the plot of residuals against fitted values now looks better.
  1. Build a better model by removing any explanatory variables that play no role, one at a time.
  1. If you want to, make a full set of residual plots for your final model (residuals vs fitted values, normal quantile plot of residuals, residuals vs all the explanatory) and convince yourself that all is now at least reasonably good. (I allow for the possibility that you are now bored with this and would like to move on to something else, but I had already done these, so…)