STAD29 Assignment 1

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)
library(broom)
library(MASS, exclude = "select")
library(marginaleffects)

Houses in Ames, Iowa

Information was collected about all the nearly 3,000 houses that were sold in Ames, Iowa between 2006 and 2010. There were a lot of variables recorded; we focus on these six:

price: selling price in US dollars
qual: rating of overall material and finish of the house (1–10 scale)
cond: overall condition of house (1–10 scale)
year: year of original construction
bath: full bathrooms “above grade” (not counting bathrooms in basement)
beds: bedrooms above grade (not counting bedrooms in basement)

The data are in http://ritsokiguess.site/datafiles/ames.csv.

(1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/ames.csv"
ames <- read_csv(my_url)

Rows: 2930 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): price, qual, cond, year, bath, beds

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ames

There are 2930 houses in total, indeed nearly 3,000.

Extra: the original dataset had 82 (!) variables, many of which were categorical, such as the one of Ames’ 28 neighbourhoods each house was in. I simplified things substantially for you.

(2 points) Fit a regression predicting selling price from all of the other variables, and display the output from the regression.

This is not meant to be difficult. A bit tedious, maybe, because you have to type all the variable names, but I simplified those for you from the original data, as well:

ames.1 <- lm(price ~ qual + cond + year + bath + beds, data = ames)
summary(ames.1)


Call:
lm(formula = price ~ qual + cond + year + bath + beds, data = ames)

Residuals:
    Min      1Q  Median      3Q     Max 
-185794  -26809   -3805   19470  386467 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -728548.94   77272.06  -9.428  < 2e-16 ***
qual          37326.06     814.06  45.852  < 2e-16 ***
cond           2378.34     838.79   2.835  0.00461 ** 
year            315.55      39.47   7.995 1.84e-15 ***
bath          18769.23    2053.70   9.139  < 2e-16 ***
beds           6009.81    1142.15   5.262 1.53e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45920 on 2924 degrees of freedom
Multiple R-squared:  0.6701,    Adjusted R-squared:  0.6696 
F-statistic:  1188 on 5 and 2924 DF,  p-value: < 2.2e-16

Everything, you will note, is strongly significant, even though the R-squared is not spectacularly high. This might be because there are so many observations. What can happen in that case is that some of the Estimates are very small, though in this case it looks as if all of them are of practical importance and have a noticeable impact on selling price (even the smallest Estimate, for year: some of these houses were a hundred years old).

Extra: a quicker way to run the regression is this:

ames.1a <- lm(price ~ ., data = ames)
summary(ames.1a)


Call:
lm(formula = price ~ ., data = ames)

Residuals:
    Min      1Q  Median      3Q     Max 
-185794  -26809   -3805   19470  386467 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -728548.94   77272.06  -9.428  < 2e-16 ***
qual          37326.06     814.06  45.852  < 2e-16 ***
cond           2378.34     838.79   2.835  0.00461 ** 
year            315.55      39.47   7.995 1.84e-15 ***
bath          18769.23    2053.70   9.139  < 2e-16 ***
beds           6009.81    1142.15   5.262 1.53e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45920 on 2924 degrees of freedom
Multiple R-squared:  0.6701,    Adjusted R-squared:  0.6696 
F-statistic:  1188 on 5 and 2924 DF,  p-value: < 2.2e-16

The . in the model formula means “all the other variables in the dataframe”, which happens to be exactly what we want here.

(3 points) Plot the residuals against the fitted values, and describe two problems that you see on this plot.

A rather generous one point for the plot:

ggplot(ames.1, aes(x = .fitted, y = .resid)) + geom_point()

The two problems I see here are, a point each:

a down-and-up curve (follow the middle of the vertical scatter all the way across)
fanning out: the spread of the residuals gets bigger as the fitted values increase. (The fancy word here is “heteroskedastic”, from the Greek for “unequal scatter”.)

The curve is perhaps a bit hidden by the variable scatter. I am not generally a fan of adding smooth trends to residual plots (because they tend to make you think something is going on when it really isn’t), but we have lots of data, so:

ggplot(ames.1, aes(x = .fitted, y = .resid)) + geom_point() +
  geom_smooth()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

which you might say is a clearer way to see the curved trend.

This is exactly the sort of thing (a curve plus increasing scatter) that a transformation of the response variable price is likely to help with. So we investigate this first, without worrying about the residuals vs. $x$s from this regression.

(3 points) Use Box-Cox on this regression. What do you conclude? Explain briefly.

To run Box-Cox, make sure you have MASS loaded (and do so excluding the MASS select that we don’t want to get mixed up with). Copy the model formula and the data = from your regression into the boxcox:

boxcox(price ~ qual + cond + year + bath + beds, data = ames)

The confidence interval for $\lambda$ is very short, because we have so many observations, but it does include $\lambda = 0$, so we can confidently suggest that a log transformation would be worth trying.

(2 points) Run the regression suggested by your Box-Cox, and display the output.

A log transformation was called for, so replace price by log(price) in your regression:

ames.2 <- lm(log(price) ~ qual + cond + year + bath + beds, 
             data = ames)
summary(ames.2)


Call:
lm(formula = log(price) ~ qual + cond + year + bath + beds, data = ames)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43748 -0.12329 -0.00224  0.12022  0.94319 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.1141937  0.3451547   11.92   <2e-16 ***
qual        0.1797784  0.0036362   49.44   <2e-16 ***
cond        0.0462088  0.0037466   12.33   <2e-16 ***
year        0.0031714  0.0001763   17.99   <2e-16 ***
bath        0.0957951  0.0091733   10.44   <2e-16 ***
beds        0.0532463  0.0051017   10.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2051 on 2924 degrees of freedom
Multiple R-squared:  0.7472,    Adjusted R-squared:  0.7467 
F-statistic:  1728 on 5 and 2924 DF,  p-value: < 2.2e-16

Everything is even more strongly significant than before, and R-squared has gone up a bit, which are encouraging signs. You might also note that all the estimates are positive, which says that “more” goes with a higher selling price: more bathrooms and bedrooms, being built more recently, and in better condition overall.

We should check the residuals again, however, before we get carried away.

(4 points) For your most recent regression, obtain plots of (i) residuals against fitted values, (ii) normal quantile plot of residuals, (iii) residuals against each of the explanatory variables.

Follow the familiar procedures (from C32):

ggplot(ames.2, aes(x = .fitted, y = .resid)) + geom_point()

and

ggplot(ames.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()

and then draw a breath and

ames.2 %>% augment(ames) %>%
  pivot_longer(qual:beds, names_to = "xname", values_to = "xval") %>% 
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~ xname, scales = "free")

If you don’t remember that, churn the plots out one at a time. They should look the same either way.

(3 points) Apart from two outliers, explain how your residual plots are basically satisfactory.

The two outliers have residuals that are very negative (about $-1.5$ when none of the other residuals go beyond $-1$). They show up at the bottom of all the plots, but things are otherwise pretty good:

residuals against fitted values: the points form a random cloud (apart from the two outliers bottom left)
normal quantile plot of residuals: this is a bit long-tailed, but the points are fairly close to the line (apart from the two outliers at the bottom). You may have to do some hand-waving here.
residuals against $x$-variables: the patterns are close to random, given that most of the variables are discrete (have only a small number of possible values). The outliers, again, appear at the bottom of these plots.

Comment on the outliers somewhere once, so that it is clear you recognize where they are, but having done that, you don’t have to comment further on them when you discuss your three plots.

I actually think the residuals vs fitted values is much better than it was before, and so predicting log of selling price is doing a much better job than predicting selling price itself.

(3 points) We want to understand our best model by doing some predictions. Let’s estimate the mean log-selling price for all houses with the following explanatory-variable values:

A: overall material rating 6, overall condition 6, constructed in 1971, 2 bathrooms, 3 bedrooms
B: overall material rating 2, overall condition 1, constructed in 1920, 1 bathroom, 2 bedrooms.

Construct a dataframe containing these values, with column names the same as the column names in the original dataframe (hint: tribble), and use it to obtain predictions for the mean log sale price. Display the predictions, along with the lower and upper 95% confidence limits for the mean log selling price.

That’s a lot, but take it one step at a time. The first thing is to make a dataframe out of the values I gave you. For hand-constructing a small dataframe, tibble or tribble are the way to go, and I like the latter better. This is the dataframe I call new:

new <- tribble(
  ~qual, ~cond, ~year, ~bath, ~beds,
  6, 6, 1971, 2, 3,
  2, 1, 1920, 1, 2
)
new

The values I gave you correspond to the columns in the original dataframe in the same order.

Now we use predictions from the marginaleffects package to get the predictions and display the columns we want:

cbind(predictions(model = ames.2, newdata = new)) %>% 
  select(estimate, conf.low, conf.high)

The estimated selling price for houses in the second group is less, which you might guess because the houses I called B are older, smaller, and in worse shape.

Code:

don’t forget the cbind around the predictions to get the same thing as I get (otherwise, you get mostly the same things, but they have different names)
the output has a lot of columns, so do a select to grab only the ones you want (in this case, few enough so that the grader can easily see that you got the right thing).

(2 points) Work out the means of each of the variables in the original dataframe.

To work out the means of a bunch of variables without naming any of them, across is your friend:

ames %>% summarize(across(everything(), \(x) mean(x)))

Or, you might remember the “lazy” way from back at the beginning of C32:

summary(ames)

     price             qual             cond            year     
 Min.   : 12789   Min.   : 1.000   Min.   :1.000   Min.   :1872  
 1st Qu.:129500   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
 Median :160000   Median : 6.000   Median :5.000   Median :1973  
 Mean   :180796   Mean   : 6.095   Mean   :5.563   Mean   :1971  
 3rd Qu.:213500   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001  
 Max.   :755000   Max.   :10.000   Max.   :9.000   Max.   :2010  
      bath            beds      
 Min.   :0.000   Min.   :0.000  
 1st Qu.:1.000   1st Qu.:2.000  
 Median :2.000   Median :3.000  
 Mean   :1.567   Mean   :2.854  
 3rd Qu.:2.000   3rd Qu.:3.000  
 Max.   :4.000   Max.   :8.000

Equally good, for the use we are about to put this to.

(2 points) Which of the confidence intervals in question 8 was shorter? Why does that make sense? Explain briefly.

If you’re as lazy as I am, you can work out the lengths of the CIs by calculating them:

cbind(predictions(model = ames.2, newdata = new)) %>% 
  select(estimate, conf.low, conf.high) %>% 
  mutate(conf.length = conf.high - conf.low)

Or you can just eyeball the confidence limits and say “about 0.02” for the first and “about 0.09” for the second one. Just as good.

The first one (1971 houses) is a lot shorter than the second one (old houses).

To figure out why, let’s take a look at the values we were predicting for, and the means of these variables:

new

ames %>% summarize(across(everything(), \(x) mean(x)))

The first row, with the shorter confidence interval, is very close to the mean on all the variables. The second row, with the longer confidence interval, is further away from the mean, sometimes much further away, on each of the variables. This is the reason its confidence interval is longer: the most accurate predictions are when the explanatory variables are closest to the mean.

Extra: I actually chose these values to predict for on purpose, one set close to the means, another set further away.

These are actually predicted log-selling prices, because that’s what our regression was predicting. Our intuition for logs of things, though, is probably not very good. What can we learn about predicted actual selling prices?

The story is to anti-log, that is exp, both the predictions and the confidence limits:

cbind(predictions(model = ames.2, newdata = new)) %>% 
  select(estimate, conf.low, conf.high) %>% 
  mutate(across(everything(), \(x) exp(x))) %>% 
  mutate(conf_length = conf.high - conf.low) %>% 
  mutate(rel_length = conf_length / estimate * 100)

The lengths of the confidence intervals are now more nearly equal, about $4000, or, you might say, to within $\pm 2000$. But to within that accuracy is not very precise when the selling price is around $50,000; it is, relatively speaking, much more precise when the selling price is around $175,000. The idea is that taking logs focuses on percent change; the percent error in the first confidence interval is much smaller than in the second one.

The last column rel_length is the length of the confidence interval expressed as a percentage of the prediction, which gets at the idea of how the second interval is relatively much longer than the first one. Also note the similarity of these last numbers to the lengths of the confidence intervals for log price.