library(tidyverse)
library(broom)
library(MASS, exclude = "select")
library(marginaleffects)
STAD29 Assignment 1
You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html
on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd
), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
Packages
Houses in Ames, Iowa
Information was collected about all the nearly 3,000 houses that were sold in Ames, Iowa between 2006 and 2010. There were a lot of variables recorded; we focus on these six:
price
: selling price in US dollarsqual
: rating of overall material and finish of the house (1–10 scale)cond
: overall condition of house (1–10 scale)year
: year of original constructionbath
: full bathrooms “above grade” (not counting bathrooms in basement)beds
: bedrooms above grade (not counting bedrooms in basement)
The data are in http://ritsokiguess.site/datafiles/ames.csv.
- (1 point) Read in and display (some of) the data.
As ever:
<- "http://ritsokiguess.site/datafiles/ames.csv"
my_url <- read_csv(my_url) ames
Rows: 2930 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): price, qual, cond, year, bath, beds
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ames
There are 2930 houses in total, indeed nearly 3,000.
Extra: the original dataset had 82 (!) variables, many of which were categorical, such as the one of Ames’ 28 neighbourhoods each house was in. I simplified things substantially for you.
- (2 points) Fit a regression predicting selling price from all of the other variables, and display the output from the regression.
This is not meant to be difficult. A bit tedious, maybe, because you have to type all the variable names, but I simplified those for you from the original data, as well:
.1 <- lm(price ~ qual + cond + year + bath + beds, data = ames)
amessummary(ames.1)
Call:
lm(formula = price ~ qual + cond + year + bath + beds, data = ames)
Residuals:
Min 1Q Median 3Q Max
-185794 -26809 -3805 19470 386467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -728548.94 77272.06 -9.428 < 2e-16 ***
qual 37326.06 814.06 45.852 < 2e-16 ***
cond 2378.34 838.79 2.835 0.00461 **
year 315.55 39.47 7.995 1.84e-15 ***
bath 18769.23 2053.70 9.139 < 2e-16 ***
beds 6009.81 1142.15 5.262 1.53e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45920 on 2924 degrees of freedom
Multiple R-squared: 0.6701, Adjusted R-squared: 0.6696
F-statistic: 1188 on 5 and 2924 DF, p-value: < 2.2e-16
Everything, you will note, is strongly significant, even though the R-squared is not spectacularly high. This might be because there are so many observations. What can happen in that case is that some of the Estimates are very small, though in this case it looks as if all of them are of practical importance and have a noticeable impact on selling price (even the smallest Estimate, for year
: some of these houses were a hundred years old).
Extra: a quicker way to run the regression is this:
.1a <- lm(price ~ ., data = ames)
amessummary(ames.1a)
Call:
lm(formula = price ~ ., data = ames)
Residuals:
Min 1Q Median 3Q Max
-185794 -26809 -3805 19470 386467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -728548.94 77272.06 -9.428 < 2e-16 ***
qual 37326.06 814.06 45.852 < 2e-16 ***
cond 2378.34 838.79 2.835 0.00461 **
year 315.55 39.47 7.995 1.84e-15 ***
bath 18769.23 2053.70 9.139 < 2e-16 ***
beds 6009.81 1142.15 5.262 1.53e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45920 on 2924 degrees of freedom
Multiple R-squared: 0.6701, Adjusted R-squared: 0.6696
F-statistic: 1188 on 5 and 2924 DF, p-value: < 2.2e-16
The .
in the model formula means “all the other variables in the dataframe”, which happens to be exactly what we want here.
- (3 points) Plot the residuals against the fitted values, and describe two problems that you see on this plot.
A rather generous one point for the plot:
ggplot(ames.1, aes(x = .fitted, y = .resid)) + geom_point()
The two problems I see here are, a point each:
- a down-and-up curve (follow the middle of the vertical scatter all the way across)
- fanning out: the spread of the residuals gets bigger as the fitted values increase. (The fancy word here is “heteroskedastic”, from the Greek for “unequal scatter”.)
The curve is perhaps a bit hidden by the variable scatter. I am not generally a fan of adding smooth trends to residual plots (because they tend to make you think something is going on when it really isn’t), but we have lots of data, so:
ggplot(ames.1, aes(x = .fitted, y = .resid)) + geom_point() +
geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
which you might say is a clearer way to see the curved trend.
This is exactly the sort of thing (a curve plus increasing scatter) that a transformation of the response variable price
is likely to help with. So we investigate this first, without worrying about the residuals vs. \(x\)s from this regression.
- (3 points) Use Box-Cox on this regression. What do you conclude? Explain briefly.
To run Box-Cox, make sure you have MASS
loaded (and do so excluding the MASS
select
that we don’t want to get mixed up with). Copy the model formula and the data =
from your regression into the boxcox
:
boxcox(price ~ qual + cond + year + bath + beds, data = ames)
The confidence interval for \(\lambda\) is very short, because we have so many observations, but it does include \(\lambda = 0\), so we can confidently suggest that a log transformation would be worth trying.
- (2 points) Run the regression suggested by your Box-Cox, and display the output.
A log transformation was called for, so replace price
by log(price)
in your regression:
.2 <- lm(log(price) ~ qual + cond + year + bath + beds,
amesdata = ames)
summary(ames.2)
Call:
lm(formula = log(price) ~ qual + cond + year + bath + beds, data = ames)
Residuals:
Min 1Q Median 3Q Max
-1.43748 -0.12329 -0.00224 0.12022 0.94319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.1141937 0.3451547 11.92 <2e-16 ***
qual 0.1797784 0.0036362 49.44 <2e-16 ***
cond 0.0462088 0.0037466 12.33 <2e-16 ***
year 0.0031714 0.0001763 17.99 <2e-16 ***
bath 0.0957951 0.0091733 10.44 <2e-16 ***
beds 0.0532463 0.0051017 10.44 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2051 on 2924 degrees of freedom
Multiple R-squared: 0.7472, Adjusted R-squared: 0.7467
F-statistic: 1728 on 5 and 2924 DF, p-value: < 2.2e-16
Everything is even more strongly significant than before, and R-squared has gone up a bit, which are encouraging signs. You might also note that all the estimates are positive, which says that “more” goes with a higher selling price: more bathrooms and bedrooms, being built more recently, and in better condition overall.
We should check the residuals again, however, before we get carried away.
- (4 points) For your most recent regression, obtain plots of (i) residuals against fitted values, (ii) normal quantile plot of residuals, (iii) residuals against each of the explanatory variables.
Follow the familiar procedures (from C32):
ggplot(ames.2, aes(x = .fitted, y = .resid)) + geom_point()
and
ggplot(ames.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()
and then draw a breath and
.2 %>% augment(ames) %>%
amespivot_longer(qual:beds, names_to = "xname", values_to = "xval") %>%
ggplot(aes(x = xval, y = .resid)) + geom_point() +
facet_wrap(~ xname, scales = "free")
If you don’t remember that, churn the plots out one at a time. They should look the same either way.
- (3 points) Apart from two outliers, explain how your residual plots are basically satisfactory.
The two outliers have residuals that are very negative (about \(-1.5\) when none of the other residuals go beyond \(-1\)). They show up at the bottom of all the plots, but things are otherwise pretty good:
- residuals against fitted values: the points form a random cloud (apart from the two outliers bottom left)
- normal quantile plot of residuals: this is a bit long-tailed, but the points are fairly close to the line (apart from the two outliers at the bottom). You may have to do some hand-waving here.
- residuals against \(x\)-variables: the patterns are close to random, given that most of the variables are discrete (have only a small number of possible values). The outliers, again, appear at the bottom of these plots.
Comment on the outliers somewhere once, so that it is clear you recognize where they are, but having done that, you don’t have to comment further on them when you discuss your three plots.
I actually think the residuals vs fitted values is much better than it was before, and so predicting log of selling price is doing a much better job than predicting selling price itself.
- (3 points) We want to understand our best model by doing some predictions. Let’s estimate the mean log-selling price for all houses with the following explanatory-variable values:
- A: overall material rating 6, overall condition 6, constructed in 1971, 2 bathrooms, 3 bedrooms
- B: overall material rating 2, overall condition 1, constructed in 1920, 1 bathroom, 2 bedrooms.
Construct a dataframe containing these values, with column names the same as the column names in the original dataframe (hint: tribble
), and use it to obtain predictions for the mean log sale price. Display the predictions, along with the lower and upper 95% confidence limits for the mean log selling price.
That’s a lot, but take it one step at a time. The first thing is to make a dataframe out of the values I gave you. For hand-constructing a small dataframe, tibble
or tribble
are the way to go, and I like the latter better. This is the dataframe I call new
:
<- tribble(
new ~qual, ~cond, ~year, ~bath, ~beds,
6, 6, 1971, 2, 3,
2, 1, 1920, 1, 2
) new
The values I gave you correspond to the columns in the original dataframe in the same order.
Now we use predictions
from the marginaleffects
package to get the predictions and display the columns we want:
cbind(predictions(model = ames.2, newdata = new)) %>%
select(estimate, conf.low, conf.high)
The estimated selling price for houses in the second group is less, which you might guess because the houses I called B are older, smaller, and in worse shape.
Code:
- don’t forget the
cbind
around thepredictions
to get the same thing as I get (otherwise, you get mostly the same things, but they have different names) - the output has a lot of columns, so do a
select
to grab only the ones you want (in this case, few enough so that the grader can easily see that you got the right thing).
- (2 points) Work out the means of each of the variables in the original dataframe.
To work out the means of a bunch of variables without naming any of them, across
is your friend:
%>% summarize(across(everything(), \(x) mean(x))) ames
Or, you might remember the “lazy” way from back at the beginning of C32:
summary(ames)
price qual cond year
Min. : 12789 Min. : 1.000 Min. :1.000 Min. :1872
1st Qu.:129500 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
Median :160000 Median : 6.000 Median :5.000 Median :1973
Mean :180796 Mean : 6.095 Mean :5.563 Mean :1971
3rd Qu.:213500 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001
Max. :755000 Max. :10.000 Max. :9.000 Max. :2010
bath beds
Min. :0.000 Min. :0.000
1st Qu.:1.000 1st Qu.:2.000
Median :2.000 Median :3.000
Mean :1.567 Mean :2.854
3rd Qu.:2.000 3rd Qu.:3.000
Max. :4.000 Max. :8.000
Equally good, for the use we are about to put this to.
- (2 points) Which of the confidence intervals in question 8 was shorter? Why does that make sense? Explain briefly.
If you’re as lazy as I am, you can work out the lengths of the CIs by calculating them:
cbind(predictions(model = ames.2, newdata = new)) %>%
select(estimate, conf.low, conf.high) %>%
mutate(conf.length = conf.high - conf.low)
Or you can just eyeball the confidence limits and say “about 0.02” for the first and “about 0.09” for the second one. Just as good.
The first one (1971 houses) is a lot shorter than the second one (old houses).
To figure out why, let’s take a look at the values we were predicting for, and the means of these variables:
new
%>% summarize(across(everything(), \(x) mean(x))) ames
The first row, with the shorter confidence interval, is very close to the mean on all the variables. The second row, with the longer confidence interval, is further away from the mean, sometimes much further away, on each of the variables. This is the reason its confidence interval is longer: the most accurate predictions are when the explanatory variables are closest to the mean.
Extra: I actually chose these values to predict for on purpose, one set close to the means, another set further away.
These are actually predicted log-selling prices, because that’s what our regression was predicting. Our intuition for logs of things, though, is probably not very good. What can we learn about predicted actual selling prices?
The story is to anti-log, that is exp
, both the predictions and the confidence limits:
cbind(predictions(model = ames.2, newdata = new)) %>%
select(estimate, conf.low, conf.high) %>%
mutate(across(everything(), \(x) exp(x))) %>%
mutate(conf_length = conf.high - conf.low) %>%
mutate(rel_length = conf_length / estimate * 100)
The lengths of the confidence intervals are now more nearly equal, about $4000, or, you might say, to within \(\pm 2000\). But to within that accuracy is not very precise when the selling price is around $50,000; it is, relatively speaking, much more precise when the selling price is around $175,000. The idea is that taking logs focuses on percent change; the percent error in the first confidence interval is much smaller than in the second one.
The last column rel_length
is the length of the confidence interval expressed as a percentage of the prediction, which gets at the idea of how the second interval is relatively much longer than the first one. Also note the similarity of these last numbers to the lengths of the confidence intervals for log price.