Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.
If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.
Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS, exclude ="select")library(broom)
The boiling point of water
The boiling point of water is commonly known to be 100 degrees C (212 degrees F). But the actual boiling point of water depends on atmospheric pressure; when the pressure is lower, the boiling point is also lower. For example, at higher altitudes, the atmospheric pressure is lower because the air is thinner, so that in Denver, Colorado, which is 1600 metres above sea level, water boils at around 95 degrees C. Source.
Some data were collected on the atmospheric pressure at seventeen locations (pressure in the data file, in inches of mercury) and the boiling temperature of water at those locations (boiling, in degrees F). This is (evidently) American data. The data are in http://ritsokiguess.site/datafiles/boiling-point.csv. Our aim is to predict boiling point from atmospheric pressure.
Rows: 17 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): boiling, pressure
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
boiling_point
There are correctly 17 rows of data with two columns boiling and pressure as promised.
\(\blacksquare\)
Draw a suitable plot of these data.
Solution
There are two quantitative variables, so a scatterplot is the thing. We are trying to predict boiling point, so that goes on the \(y\)-axis:
ggplot(boiling_point, aes(x = pressure, y = boiling)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Adding a smooth trend will help us to get a sense of what kind of relationship there is. This is better than adding a regression line to this plot, because we don’t know whether the relationship is straight.
Comment on this one below.
\(\blacksquare\)
Comment briefly on your plot and any observations that appear not to belong to the pattern shown by the rest of the observations.
Solution
There is a clear upward trend, one that appears to be linear, apart from the two observations on the left that have a higher boiling point than you would expect for such a low pressure.
Do not, at this point, suggest removing these points. If they are genuine, they have to be included in our model. Removing them for no reason (other than “they don’t fit a line”) is scientific dishonesty, because you want a model that will generalize to future observations, including ones like those on the left. If you want to make a suggestion, suggest checking whether those points on the left are errors, and if they are errors, removing them. That is the only justification for removing data points: in that case, they are not like the others, for a reason you can explain.
\(\blacksquare\)
Fit a suitable linear regression, and display the results. (Do this even if you think this is not appropriate.)
Solution
You probably think that those points over on the left mean that we should not fit a regression at all (which is the reason for my parenthetical comment in the question), but fit the regression here anyway. We will critique it shortly.
boiling_point.1<-lm(boiling ~ pressure, data = boiling_point)summary(boiling_point.1)
Call:
lm(formula = boiling ~ pressure, data = boiling_point)
Residuals:
Min 1Q Median 3Q Max
-1.41483 -0.91550 -0.05148 0.76941 2.72840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 163.9307 2.6551 61.74 < 2e-16 ***
pressure 1.5796 0.1051 15.04 1.88e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.291 on 15 degrees of freedom
Multiple R-squared: 0.9378, Adjusted R-squared: 0.9336
F-statistic: 226 on 1 and 15 DF, p-value: 1.879e-10
\(\blacksquare\)
Comment briefly on whether the slope makes sense, and on the overall fit of the model.
Solution
There are actually three things I would like to see:
the slope is positive, as predicted in the question (a lower pressure goes with a lower boiling point, and thus higher with higher also).
the slope is (very strongly) significant, meaning that the positive slope here is far from just chance.
the R-squared is 94%, meaning that the regression fits well even despite the two points on the left of the scatterplot.
\(\blacksquare\)
Make two suitable plots than can be used to assess the appropriateness of this regression.
Solution
The two plots in question are of the residuals vs. fitted values, and a normal quantile plot of the residuals. Don’t forget to use your fitted model object (my boiling_point.1) as the “dataframe” to feed into ggplot (see below for why the warning occurs and how you can fix it):
ggplot(boiling_point.1, aes(x = .fitted, y = .resid)) +geom_point()
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
You might have been wondering how ggplot knows where to get the fitted values and residuals from. The answer, at least for now in ggplot2, is that it runs a function called fortify to get them from the fitted model:
fortify(boiling_point.1)
This does indeed contain .fitted, .resid and other things from the regression, along with the original data (in the first two columns). The problem is that this is duplicating work that is also done in the broom package, specifically its function augment. This is a problem because if augment changes, fortify would have to change to keep in step, and this is a lot of work. So the people behind ggplot2 decided they would recommend using augment from broom instead. This is a good decision because broom is under active development to encompass different kinds of models, and so augment will always be current, and ggplot2 does better to hand off that work to broom and not try to keep fortify current as well. “Don’t repeat yourself”, as software people say.
So how to use augment? First, install and load broom. On r.datatools, it’s already installed and you only need the library(broom) line. (My library(broom) is at the top of this worksheet.) Then, make an “augmented” dataframe as follows:
This (like the output from fortify above) contains the original data plus things like fitted values (in .fitted) and residuals (in .resid). My “naming convention” is to put a letter a (for “augmented”) on the end of the model. Then, use this dataframe to make plots with, such as
ggplot(boiling_point.1a, aes(x = .fitted, y = .resid)) +geom_point()
and now you get no warnings.
As I type this, it’s Monday night with a lecture on Tuesday on this stuff. I will eventually have to reorganize my lecture notes, which will probably mean bringing in broom from earlier on in regression and making the augmented dataframe as the first step after running the regression, but for now, it’s all right to get warnings when you make your graphs.
The lm function is one of R’s oldest; you can imagine that one of the first things early users of R would have wanted to do is to run regressions. The summary output shows its age rather; it is not very aesthetic, and is designed for the human eye to read (and not for programmers to pull bits out of). A bonus of loading broom is that you get the glance and tidy functions, which summarize the regression output in dataframes:
glance(boiling_point.1)
This is a one-line summary of the entire model, including R-squared and the global \(F\)-test (of whether anything in the model is helpful, to which the answer will almost always be yes). Also this:
tidy(boiling_point.1)
which shows intercept and slope(s), and tests for their significance. For example, you might want to pick out the slope estimate:
tidy(boiling_point.1) %>%pluck("estimate", 2)
[1] 1.579647
Extra: On the residuals vs. fitted values, the two strange observations show up a long way from the others at the top left. The other residuals also appear to have an upward trend. This is characteristic of the off-trend points being influential: they pull the line closer to themselves than it should really go. Here is the scatterplot from before, but this time with the regression line on it:
ggplot(boiling_point, aes(x = pressure, y = boiling)) +geom_point() +geom_smooth(method ="lm")
`geom_smooth()` using formula = 'y ~ x'
When the off-trend points are unusually high or low (these ones are the lowest), they can have a disproportionate influence over where the line goes. Here, the line is higher on the left and lower on the right than it would be if those two low-pressure points were not there. (A better line then would be steeper than this one, lower on the left and higher on the right.) This has the effect of producing a lot of negative residuals on the left and a lot of positive ones on the right, which explains the trend in the other points on your plot.
The normal quantile plot is actually not bad. I expected those two very positive residuals to look like outliers at the top, and they really don’t; the normal quantile plot suggests a right-skewed distribution of outliers (an upward-opening curve) more than upper-tail outliers.
In this case, it is the residual vs. fitted value plot that really reveals the problem. The fact that there is a problem on one of these plots means that the regression as a whole is not appropriate.
\(\blacksquare\)
When you looked at your scatterplot, you may have identified some observations that did not follow the pattern of the others. Describe briefly how these observations show up on the two plots you just drew.
Solution
On the scatterplot, we saw two points that were off a linear trend on the left. These had low pressure and a higher boiling point than expected given that low pressure. These will show up as very positive residuals:
on the residuals vs fitted values, they are up in the top left corner
on the normal quantile plot of residuals, they are the two most positive residuals, at the top right.
Try to be as brief as this (you may need to think about what are the most relevant things to say).
\(\blacksquare\)
It turns out that the two observations with the lowest pressure are errors. Create a new dataframe with these observations removed, and repeat the regression. (You do not need to make the residual plots.)
Solution
Find a way to select the two observations with the lowest pressure. The easiest way is to note that their pressure values, whatever they are, must be less than about 21.25 (the major ticks on my scatterplot are 2.5, so the minor ticks are 1.25, and so the vertical line in the graph background is 1.25 less than 22.5, ie. at 21.25.) That would give this:
slice with negative input says “everything but those rows”, or you could feed slice rows 3 through 17 of the original dataframe.
The data values are already sorted by pressure, in fact, so if you don’t do the sorting yourself you need to look and see that the two values you want to remove actually are the first two.
A third way is to use slice_max. It combines the slice with the sorting, so that you can do it all in one step:
boiling_point %>%slice_max(pressure, n =15)
The n = 15 is to say that you want to keep the \(17-2 = 15\) rows that have the largest pressure values, which is the same thing as throwing away the two rows with the smallest ones.
I had a hard time naming this dataset. I didn’t want to use numbers because I am using those for models.
All right, the regression:
boiling_point.2<-lm(boiling ~ pressure, data = boiling_point_no_lo)summary(boiling_point.2)
Call:
lm(formula = boiling ~ pressure, data = boiling_point_no_lo)
Residuals:
Min 1Q Median 3Q Max
-1.16358 -0.22154 0.01851 0.21714 0.76407
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 157.54528 1.24593 126.45 < 2e-16 ***
pressure 1.81477 0.04827 37.59 1.19e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5041 on 13 degrees of freedom
Multiple R-squared: 0.9909, Adjusted R-squared: 0.9902
F-statistic: 1413 on 1 and 13 DF, p-value: 1.193e-14
\(\blacksquare\)
Compare the slope and the R-squared from this regression with the values you got in the first regression. Why is it sensible that the values differ in the way they do?
Solution
I’m going to make a little Markdown table out of the values, for ease of comparison:1
Regression
Slope
R-squared
Regression 1
1.58
0.938
Regression 2
1.815
0.991
The second regression has both a more positive slope and a bigger R-squared than the first one.
The slope is bigger (more positive) after we remove the two observations because they had a higher than expected boiling point, and that pulled the line up on the left compared to the regression without those two observations. Thus the first line is less steep than the second, meaning that it has a less positive slope.
The R-squared is bigger because all the points, apart from the two we removed, are very close to a straight-line trend, and so the second regression should be, and is, very close to a perfect fit.
Extra: when you make a table like this, you should really pull the values out of the R output and not type them again. (What if you change the data, and then your numbers will be wrong? What if you realize you made a mistake earlier?)
Here are the ingredients I used, followed by my “recipe”:
$ for pulling things out of other things
tidy from the broom package for extracting intercept and slope
glance from broom for extracting R-squared (I probably didn’t need to use broom but it’s easier this way)
round to round a result to a fixed number of decimals
list to make a structure to store my results in and get them out of
creating a variable to hold my number of decimal places, in case I change my mind after seeing the results
create a list called ans to hold all the results. Lists are a very flexible data structure in R; you can basically put anything in an R list, even another list, which we are going to do shortly.
inside ans, create a list r1 that will hold the results we want from the first regression.
r1 contains a thing called slope that will contain the slope from the first regression:
feed the first model into tidy
extract the thing called estimate, which contains the intercept and slope in that order
extract the second one of those, which is the slope
round it to your pre-specified number of decimal places
r1 also contains a thing called rsq which contains the R-squared for the first regression:
feed the model into glance
extract the thing called r.squared
round it to the right number of decimals
repeat the whole thing for the second regression by creating a list r2 with the same things in it (copy, paste, edit)
This seems very complicated, but (i) after you’ve done it once, you get the hang of it, and (ii) the payoff comes when we make the table because we calculated exactly what we need. The rounding is always the most annoying part.
There are several ways to make tables in Quarto. I like the one below:
Columns are separated by |, which is the Linux “pipe” symbol.2 This means that columns don’t have to line up.3 The long row of - draws a horizontal rule between the first row (formatted as headings) and the rest of the table. To insert the values we went to such great trouble to obtain, we use “inline R code”. To do this, a backtick followed by {r} followed by some code followed by another backtick displays the result of running this code.4 This can get unwieldy pretty quickly, so the idea is to structure things so that what you are inserting here is concise. Our list structure in ans makes that happen: for example, to get the R-squared value for the second regression, you ask for the thing called rsq within the list r2 (representing the second regression) within the big list ans, using dollar signs to dig down through the data structure we made.
I got this idea from Tristan Mahr. He has some extra cleverness to automate the building of the list (my ans), instead of cutting and pasting as I did before.
Extra 1:
If you are feeling brave enough, you can support your answers above by drawing a scatterplot with both lines on, to make it clearer as to why they differ in the way they do. There are several ways you might tackle this; here’s mine:
ggplot(boiling_point, aes(x = pressure, y = boiling)) +geom_point(colour ="red") +geom_smooth(method ="lm", linetype ="dashed", se =FALSE) +geom_point(data = boiling_point_no_lo) +geom_smooth(data = boiling_point_no_lo, method ="lm", se =FALSE)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
There are some things to be aware of:
the ggplot has the original dataframe as its default, so I begin by plotting the points and regression line for that (the first regression).
I plot all the data points in red (you’ll see why most of them look black in a moment).
the first regression line is drawn dashed, so that we can use the default solid line for the second one. (You could also draw it a different colour.) I removed the default grey “envelope”, since the plot will be busy enough.
then I plot the points of the second data set (without the two low-pressure ones). This is a different data set than in the original ggplot, so I have to use data = inside. The points will be the default black, and the x and yfor the plot are still pressure and boiling, so I don’t need to say anything else. What this does is to draw black points over the top of the red ones we drew in the first geom_point, for all the points in the second data set. This leaves the low-pressure points red.
finally, I add the regression line for the second data set. I have to tell ggplotagain to use the second data set, or else it will use the one in the ggplot statement. (I was caught by this the first time I made this graph, and only got one line.) Note that ggplot will not let you extrapolate; the lines only go as far as the most extreme \(x\)-values in the data set they are based on.
This shows how the first line is pulled up at the low end by the two red points, which are still quite far from the (dashed) line. Hence the second line will have a bigger (more positive) slope, and the remaining points are closer to it, so it will have a higher R-squared as well.
Extra 2: I didn’t ask for the residual plots for the second regression. I hope there will be no problems with them now.5 But in case you are interested:
ggplot(boiling_point.2, aes(x = .fitted, y = .resid)) +geom_point()
That is kind of odd, with the upward trend, the point at the bottom, then the downward trend. Let’s try to investigate more:
boiling_point.2%>%augment(boiling_point_no_lo)
The odd one at the bottom of the plot is the 10th observation, observed 204.6 and predicted 205.8. I think what has happened is that, having removed those two very off-trend points on the left, we now have an off-trend point in the middle. This is a legit observation, though, so we are stuck with it. As the two off-trend points on the left did, it pulls the line (slightly) towards itself (down) and away from the line that we would get if we took this point out.6 In the same way that we got a funny trend in the other points in the residual plot from the first regression, what this one says is that if you were to take this point out, you’d have a different trend for the other points again. To go back and look at my plot (the one with both regression lines on it), the impression seems to be that if you fitted a regression line to just the points to the left of obs. 10, it would have a steeper slope, and if you fitted one to just the points to the right of obs. 10, it would be less steep. That would explain the pattern on the plot.7 I think I have that right. Feel free to check by working it out yourself.
That seems to say long tails, but the very negative residual, that observation we were just talking about, really stands out at the bottom left. If you look back at the scatterplot, you can now see that it stands out there as well, in the middle of the plot and clearly below the line.
\(\blacksquare\)
Thermal spray coatings
A coating is sprayed onto stainless steel, and the strength of the bond between the coating and the stainless steel is measured (in megapascals). Five different thicknesses of coating are used (measured in micrometres), and an engineer is interested in the relationship between the thickness of the coating and its strength. Some data are in http://ritsokiguess.site/datafiles/coatings.csv.
Read in and display (some of) the data.
As ever (you could be forgiven for getting bored with this by now):
Rows: 20 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): thickness, strength
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
coatings
Draw a suitable graph that illustrates how the bond thickness influences the strength.
Two quantitative variables, so a scatterplot. The thickness is explanatory and the strength is the response. There are some clues: the word “influences” in the question, where \(x\) has an influence on \(y\) but not the other way around. Also, the thickness of the bond is something we can control, which is a good sign that it is explanatory, and the strength is whatever happens after we measure it, so it must be the outcome. You can’t always control things in a regression situation (remember the difference between an experiment and an observational study), but if you can control it, it cannot be the response.
ggplot(coatings, aes(x = thickness, y = strength)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
At this exploratory stage, I would add a smooth trend — not a line, because I don’t know whether the relationship is linear yet.
Comment briefly on the kind of relationship you see here, if any.
This looks to me like a curved relationship, one that seems to go up and then back down.
Those points over on the left are in fact correct, so even though the rest of the relationship looks satisfactorily linear, we cannot remove the points with the lowest thickness and try to fit a line through the others.
Fit a straight-line regression and display the results. (You will have an opportunity to criticize it shortly.)
coatings.1<-lm(strength ~ thickness, data = coatings)summary(coatings.1)
Call:
lm(formula = strength ~ thickness, data = coatings)
Residuals:
Min 1Q Median 3Q Max
-9.0435 -2.3624 0.9909 2.3624 5.6503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.537238 2.276852 12.973 1.42e-10 ***
thickness -0.022699 0.004049 -5.606 2.55e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.129 on 18 degrees of freedom
Multiple R-squared: 0.6358, Adjusted R-squared: 0.6156
F-statistic: 31.42 on 1 and 18 DF, p-value: 2.552e-05
This actually doesn’t look so bad. There is a significant downward trend with thickness (the trend is more down than up), and R-squared is decent if not great.
By making a suitable plot, demonstrate that the relationship is actually curved rather than linear.
The best one for the purpose is residuals against fitted values; if that is curved, then the relationship is really curved:
ggplot(coatings.1, aes(x = .fitted, y = .resid)) +geom_point()
On one of these, if you can describe a trend in the residuals in a few words, there is a problem with the regression. Here you certainly can: the residuals go up and then down again. This is a curve, so the original relationship is also curved not linear.
Extra: the other plot you might have considered here is a normal quantile plot of residuals:
There is a little evidence of a left skew. This indicates that things could be improved, but it doesn’t tell us how they might be improved. The message from the plot of residuals vs fitted values is much clearer: “this is curved, so fit a curve”.
Add a squared term in thickness to your regression, and display the output.
Don’t forget the I() thing:
coatings.2<-lm(strength ~ thickness +I(thickness^2), data = coatings)summary(coatings.2)
Call:
lm(formula = strength ~ thickness + I(thickness^2), data = coatings)
Residuals:
Min 1Q Median 3Q Max
-5.7393 -2.3297 0.6785 2.2874 4.9330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.444e+01 4.744e+00 3.045 0.00732 **
thickness 4.438e-02 1.977e-02 2.245 0.03836 *
I(thickness^2) -6.130e-05 1.783e-05 -3.439 0.00313 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.263 on 17 degrees of freedom
Multiple R-squared: 0.7852, Adjusted R-squared: 0.7599
F-statistic: 31.07 on 2 and 17 DF, p-value: 2.099e-06
The Estimate for thickness-squared is very small in size. Why, nonetheless, was it definitely useful to add that squared term?
Asking whether it is useful to have something in a regression is a good sign that the significance of the term is important. In this case, the P-value of the squared term is 0.0031, smaller than 0.05, so having it in the regression definitely adds to what was there before: it contributes over and above chance. The reason it is significant, despite being so small in size, is that its standard error is even smaller, so it is estimated very accurately:
The estimate may be close to zero, but at a 95% level of confidence, we are convinced it is negative: not very negative, but definitely negative.
The negative sign on the Estimate means that it is an opening-downward parabola: that is to say, one that has a maximum.
The other thing you might have looked at is R-squared. This is (substantially) higher than it was before (78.5% vs 63.6%), indicating that the regression now fits noticeably better. The problem with just looking at R-squared is that adding things to a regression will always increase R-squared (maybe for real, maybe by chance), and unless you have a good intuition, it’s hard to tell whether the increase you see is real or chance. That’s why we have hypothesis tests!
Extra: we investigate the marginaleffects package in D29, but one nice thing it gives us, that we can appreciate now, is a plot of predicted values:
The “envelope” around the predictions indicates the uncertainty in the predictions. There is some, because there was variability in strength even when the thickness was the same (also shown by the R-squared from coatings.2 being clearly less than perfect). Once the thickness gets beyond a certain point, the strength definitely decreases.
You might imagine that one reason for collecting data like this is that the engineer wants to find the best thickness, that is to say, the thickness for which the strength is greatest. On this graph, that appears to be about 375: definitely less than 400 and greater than 300. One way to find this optimal thickness would be to start by running a study like this one to narrow down the answer some, and then to follow up with a second study using thicknesses in a shorter interval (say, 300 to 400 in steps of 20) to focus in on the strengths for thicknesses closer to the optimal value.
Is the plot of residuals vs fitted values better from your second regression than it was from the first one? Draw it, and explain briefly.
Copy, paste, and edit your code:
ggplot(coatings.2, aes(x = .fitted, y = .resid)) +geom_point()
You are looking for a random scatter points across the graph, bearing in mind that there are only five different fitted values because there were only five different thicknesses. I don’t think this is perfect: it seems to bend downwards a little in the middle, but it is (in my opinion) much closer to being a random scatter of points than it was before. So I would call this one “improved” rather than saying it is as good as I would like.
Extra: When you fit a curve and you still see something of a bend on this plot, the suspicion is that you fitted the wrong kind of curve. Let me see if I can add the data to the plot of predictions:
I don’t know what aes the original plot is using, but I know it’s a ggplot, so if I specify my dataset and x and y in the geom_point, I figure I can add the original data to the plot, and so it proves.
If I am going to be critical, I would say that the curve doesn’t go up high enough for the thickness values around 400, and it doesn’t go down low enough for the ones near 700. These show up on the residual plot respectively as:
a few too many positive residuals when the predicted strength is over 20
too many negative residuals when the predicted strength is around 15.
The second of these is mainly what created the impression of an up-down-up curve on our residual plot.
Houses in Duke Forest, North Carolina
The data in http://ritsokiguess.site/datafiles/duke_forest.csv are of houses that were sold around November 2020 in the Duke Forest area of Durham, North Carolina. For each house, the selling price (in US $), called price, was recorded, along with some other features of the house:
bed: the number of bedrooms
bath: the number of bathrooms
area: the area of the inside of the house, in square feet
year_built: the year the house was originally built
Our aim is to predict the selling price of a house from its other features. There are 97 houses in the data set.
Note: this is rather long, but I wanted to give you a chance to practice everything.
Read in and display (some of) the data.
Solution
Suggestion, before you start: we’ll be using MASS later, so install and load that first, before even tidyverse. Then load broom as well if you plan to use that later.
Rows: 97 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): price, bed, bath, area, year_built
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
duke_forest
There are indeed 97 rows and the variables promised.
Extra: this is a simplified version of the duke_forest dataset from the openintro package. That has 98 houses, but one of them has a very low selling price given how big the house is (and thus had a very negative residual in the regressions that I ran). So I removed that house from the dataset I gave you, on the basis that there is something very different about that house. (The dataset in that package has more variables, which I also removed for you.)
Extra extra: if you are a fan of college basketball, you may recall that the Duke Blue Devils and Duke University come from Durham, North Carolina, which is undoubtedly where the Duke Forest name comes from. According to Google Maps, if you go south from the Duke University campus, past the big park, you reach the Duke Forest neighbourhood where these houses are.
Make a graph of selling price against each of the explanatory variables, using one ggplot line.
Solution
For a plot, we want the selling price in one column (it already is) and all the explanatory variable values in another column, with a third column labelling which explanatory variable it is. Then, to make the plot, plot selling price against the \(x\)-column, making facets according to the names of the \(x\)-variables.
This is done by pivoting longer first. I also always forget (until I see the graph) that a scales = "free" is needed, because different explanatory variables will be on different scales:
If that code is too much for you, run the pivot-longer first and look at the results. Then think about how you would use that dataframe to make scatterplots of selling price against each of those explanatory variables.
The reason for the scales = "free" is that each of the explanatory variables is on its own scale, so you want the graph to have a different scale for each of them. For example, the year is a number like 2000, but the number of bedrooms is a number like 4. You can see, by taking out the scales = "free", that it doesn’t work very well to have all of them on the same scale!
Comment briefly on your plots.
Solution
All of them except for year_built (bottom right) show an upward apparently linear trend, with at least one outlier. I don’t think there is any trend with the age of the house.
This makes sense, because you would expect a house with more bedrooms or bathrooms, and a larger square footage, to be more expensive to buy.
That’s about as much as you need to say. There are not any problems with non-linearity here.
Extra: One of the houses was sold for near $1.4 million, and is at the top of each of these plots. This house has three bedrooms and four bathrooms, which makes you wonder why it sold for so much, until you look at the area plot, where it is at the top right: a huge house that happens to have only three bedrooms and four bathrooms. The hugeness is what made it so expensive.
Fit a regression predicting price from the other variables, and display the results.
Solution
price.1<-lm(price ~ area + bath + bed + year_built, data = duke_forest)summary(price.1)
Call:
lm(formula = price ~ area + bath + bed + year_built, data = duke_forest)
Residuals:
Min 1Q Median 3Q Max
-446286 -76678 -11789 77958 442108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -308181.85 1623259.49 -0.190 0.84984
area 148.05 21.91 6.756 1.27e-09 ***
bath 70752.06 23685.14 2.987 0.00361 **
bed -33713.81 25488.64 -1.323 0.18921
year_built 189.11 832.21 0.227 0.82074
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141600 on 92 degrees of freedom
Multiple R-squared: 0.6082, Adjusted R-squared: 0.5912
F-statistic: 35.71 on 4 and 92 DF, p-value: < 2.2e-16
Or, if you prefer,
glance(price.1)
tidy(price.1)
or just the second one of those.
What is the meaning of the number in the bath row in the Estimate column?
Solution
This value 70752 is the slope of bath in this multiple regression. It means that a house with an extra bathroom but the same number of bedrooms, same amount of square footage and built in the same year, is predicted to sell for just over $70,000 more. (The “all else equal” is important.)
This is, you might say, the “value” of an extra bathroom, in that if somebody wanted to remodel a house to add an extra bathroom, which would of course cost something, they could balance the construction cost with this slope to decide whether adding an extra bathroom is worth doing before they sell the house.
Extra: The coefficient for bed is negative, which appears to say that having another bedroom decreases the selling price, but look at the non-significant P-value: the slope coefficient for bed is actually not significantly different from zero. That is, once you know the other things, having an extra bedroom does not significantly increase the selling price.8 The reason for this is probably that a house with more bathrooms will also tend to have more bedrooms: that is, that bathrooms and bedrooms will be positively correlated. In regression, having correlated explanatory variables can cause some of them to be less significant than you would expect (“multicollinearity”). The idea is that a house with more bathrooms likely has more bedrooms, and you can infer this without actually needing to know how many bedrooms it has.
Extra: As we saw on the scatterplots, more bathrooms (and more area) go with higher selling prices, at least on average. These two explanatory variables are also significant. But bed has a negative slope, which seems to mean that a house with more bedrooms should sell for less. This makes no sense, and is not what the scatterplot we drew earlier says. As part of an explanation for that, note that bed is not significant, meaning that knowing the number of bedrooms doesn’t help in predicting the selling price of a house; thus that negative slope could just as easily be zero.
This is about half an explanation, since we would have expected the number of bedrooms to affect the selling price. Strictly speaking, what we mean is that if you know the number of bathrooms and the square footage, then knowing the number of bedrooms doesn’t improve your prediction of the selling price. (If all you knew was the number of bedrooms, then that knowledge would indeed tell you about the selling price, but we are saying something different here: if you know the number of bedrooms and some other stuff, including the number of bathrooms, knowing the number of bedrooms doesn’t tell you anything more than the other stuff does. Everything in multiple regression is relative to the model you are looking at.)
When something is unexpectedly non-significant in a multiple regression, the reason is often that it is correlated with one of the other explanatory variables (the dreaded “multicollinearity”). Here, you would expect that a house with a lot of bathrooms would also tend to have a lot of bedrooms:
ggplot(duke_forest, aes(x = bath, y = bed)) +geom_jitter()
There is at least some correlation there, which suggests that once you know how many bathrooms a house has, you can make a pretty good guess at how many bedrooms it has, even if you don’t actually know that. (Or, once you know the number of bathrooms, there is not much extra information in the number of bedrooms. You might imagine also that big houses, with a large square footage, will tend to have lots of bedrooms and bathrooms, so that area comes into the picture here as well.)
I recognized that the numbers of bathrooms and bedrooms are both whole numbers, so if I plotted them using geom_point, there would be a lot of points overwriting each other and I may not see them all. geom_jitter moves the points slightly (at random) so that you can see them all. The movement is “less than halfway to the next value” in both directions, so if you round each plotted point to the nearest whole number, that’s where it actually is. For example, the four points close to 4 bathrooms and 5 bedrooms actually are 4 bathrooms and 5 bedrooms, because even the rightmost point, plotted at something like 4.2 bathrooms, is still closest to 4 bathrooms.
A more detailed way to assess multicollinearity is to calculate the “variance inflation factor” for each explanatory variable. If this is high (the guideline I learned is “higher than 10”), then the explanatory variable in question is highly correlated with others and is giving a deceiving picture for that reason. Here is how you might find the VIFs here, using the package car which you might have to install first, on a model that still has bed in it:
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
vif(price.1)
area bath bed year_built
1.957408 2.311307 1.766550 1.132533
These are actually not bad at all. The reason that beds is not significant seems literally to be that once you know the number of bathrooms and the square footage, knowing the number of bedrooms does not help you predict selling price. I was surprised to discover this, but that seems to be the way it is.
Plot the residuals from your regression against the fitted values. What evidence is there that a transformation of the selling prices might be a good idea? (Hint: look at the right side of your graph.)
Solution
Use the fitted model object as if it were a dataframe:
ggplot(price.1, aes(x = .fitted, y = .resid)) +geom_point()
This looks pretty close to random, but if you look at the right side of the graph, you’ll see that the most extreme residuals (the two most positive ones and at least the four most negative ones) are all on the right side and not on the left. There is thus a little evidence of fanning out, which is the sort of thing a transformation of the response variable can help with. The flip side of that is that most of the residuals close to zero are on the left rather than the right, although admittedly there are more points on the left as well.
If you think about prices, this actually makes sense: higher prices are likely to be more variable, because a small change in a larger price is likely to be a larger number of dollars. Put another way, a change in something like the selling price of a house is likely to be expressed as a percentage, and a change of say 10% is a larger number of dollars if the price is larger to begin with.
If a percentage change is meaningful here, the right transformation is to take logs. We’ll see what gets suggested shortly.
Run Box-Cox. What transformation of price is suggested, if any?
Solution
I suggest you copy-paste your lm from above, and change the lm to boxcox (you don’t need to change anything else). Most of what follows you can do by copying, pasting and editing:
boxcox(price ~ area + bath + bed + year_built, data = duke_forest)
The peak of the graph is close to 0.5, and the confidence interval goes from about 0.2 to about 0.8. We are looking for a whole number or a whole number plus a half for our transformation, and the only one that fits the bill is 0.5. Power 0.5 is square root, so that is the transformation we will use.
Note that the graph rules out power 1 (“do nothing”, since raising to a power 1 doesn’t change anything), and also “power 0” (take logs), because these are both outside the interval.
Rerun your regression with a suitably transformed response variable, and display the results.
Solution
This, or use (at least) tidy from broom again. Copy, paste, and insert the sqrt, changing the model number:
price.2<-lm(sqrt(price) ~ area + bath + bed + year_built, data = duke_forest)summary(price.2)
Call:
lm(formula = sqrt(price) ~ area + bath + bed + year_built, data = duke_forest)
Residuals:
Min 1Q Median 3Q Max
-268.536 -45.398 2.849 56.032 215.263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -178.16596 1055.18390 -0.169 0.866287
area 0.08654 0.01425 6.075 2.77e-08 ***
bath 52.55888 15.39629 3.414 0.000955 ***
bed -17.44161 16.56864 -1.053 0.295241
year_built 0.29488 0.54097 0.545 0.587010
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 92.08 on 92 degrees of freedom
Multiple R-squared: 0.6056, Adjusted R-squared: 0.5884
F-statistic: 35.32 on 4 and 92 DF, p-value: < 2.2e-16
Confirm that the plot of residuals against fitted values now looks better.
Solution
Copy-paste the previous one and change the previous model to the current one:
ggplot(price.2, aes(x = .fitted, y = .resid)) +geom_point()
The most extreme residuals no longer seem to be in any predictable place (sometimes left, sometimes right), and knowing the fitted value doesn’t seem to say anything about what the residual might be, so I say this is better.
Build a better model by removing any explanatory variables that play no role, one at a time.
Solution
There are two ways to go here: copy-paste your previous regression and literally remove what you want to remove,9 or use update, which is a bit more concise, and is what I did. The first \(x\)-variable to come out is the one with the highest P-value, which is year_built (the intercept always stays):
Call:
lm(formula = sqrt(price) ~ area + bath, data = duke_forest)
Residuals:
Min 1Q Median 3Q Max
-265.790 -52.915 0.374 56.264 205.081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 360.91621 33.96588 10.626 < 2e-16 ***
area 0.08202 0.01364 6.011 3.48e-08 ***
bath 48.71751 13.57120 3.590 0.000528 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 91.8 on 94 degrees of freedom
Multiple R-squared: 0.5995, Adjusted R-squared: 0.5909
F-statistic: 70.34 on 2 and 94 DF, p-value: < 2.2e-16
and now everything is significant, so here we stop.
The reason for doing this is that we are trying to understand what selling price really depends upon, and now we have our answer. Machine-learning people say that the model with all four \(x\)s is “over-fitted”, meaning that results from it wouldn’t generalize to other house prices. Now that we have only significant explanatory variables, we can be pretty confident that these will generalize.
If you want to, make a full set of residual plots for your final model (residuals vs fitted values, normal quantile plot of residuals, residuals vs all the explanatory) and convince yourself that all is now at least reasonably good. (I allow for the possibility that you are now bored with this and would like to move on to something else, but I had already done these, so…)
Solution
Perhaps the best approach here is to start with the augmented data (you’ll see why later):
price.4a <-augment(price.4)price.4a
You see the original variables, including the square root transformation that we used in our regression, plus the things from the regression. This is useful for plotting residuals against explanatory variables, except that when we do that, we want to use all of our original explanatory variables, in case there were any non-linear relationships that we missed. So we try again:
price.4a <-augment(price.4, data = duke_forest)price.4a
This now includes bed and year_built, so that we can include them in our plots.
Off we go. Residuals vs. fitted values:
ggplot(price.4a, aes(x = .fitted, y = .resid)) +geom_point()
Residuals vs \(x\)s. This is very like what you did before, only now using the residuals rather than the response. Everything we need is in the augmented data, so start with that:
I think these are all good enough. The residuals vs fitted values looks random; the normal quantile plot has slightly long tails but not really enough to worry about; the residuals vs \(x\)s look random, allowing for the discrete distributions of bath and bed.10
The reason for including the \(x\)s we dropped from the model in the last graph is that any non-linear dependence on these \(x\)s will show up here, and we could go back and add that \(x\) squared if needed. Here, though, there are no patterns at all in the two graphs in the bottom row, so we can be confident that we got it right: removing those explanatory variables was the right thing to do because they seem to have no relationship with the response, once the other two variables area and bath are in the model.
Footnotes
You only need enough decimals to make the comparison clear. Two is enough, but I went with three. The third digit of the first slope is zero, which ought to be shown really, but isn’t.↩︎
Not to be confused with %>%, which is the tidyverse pipe.↩︎
Which we are not going to do. If you do, what is likely to happen is that you will discover something else suspiciously off the trend, remove it, and then discover something else off the trend, and the process never ends. As I said, the wise thing to do is to only remove observations if you can be confident they are actual errors, for a reason you can name.↩︎
One of the reasons that I didn’t have you draw this plot was that the question was long enough already; another is that I didn’t want to make you grapple with this.↩︎
If you didn’t know the other things, the number of bedrooms would undoubtedly have an impact, but it does not, over and above those other things.↩︎
You can see that some of the houses have 2.5 bathrooms, and you might be wondering how that can be. A half-bathroom, according to this, is a bathroom without an actual bath (or shower), typically a toilet and washbasin. You might have one on the main floor of a two-floor house, to save you or your guests having to go upstairs.↩︎