Worksheet 9

Published

March 22, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There might be extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 Open Space

An organization called OASIS has tabulated the amount of parkland or open space within city limits for each of a number of US cities. Our aim is to predict the amount of open space from the population. The population of each city is given in thousands, and the amount of open space is in acres.1

The data are in http://ritsokiguess.site/datafiles/open-space.csv.

  1. Read in and display (some of) the data.

  2. Make a suitable plot of these data, bearing in mind the aims of the analysis. Add a regression line to your plot.

  3. Fit a regression predicting open space from population, and display the results.

  4. Is there a significant effect of population on the amount of open space? Explain briefly. What does this mean, in the context of the data?

  5. When some of the values of a variable are much larger than others, a common strategy is to take logs of all such variables (both the quantitative variables, here). Find a way to re-draw your scatterplot on log-scales for both variables. Cite your source, or describe your thought process, as appropriate. (In R, log means natural (base \(e\)) log, which is fine here.)

  6. Run a regression predicting log-open space from log-population. Display the results.

  7. Using your regression of the previous part, predict the amount of open space in a city that has a population of 500 (thousand). Show your process.

Open Space: my solutions

An organization called OASIS has tabulated the amount of parkland or open space within city limits for each of a number of US cities. Our aim is to predict the amount of open space from the population. The population of each city is given in thousands, and the amount of open space is in acres.2

The data are in http://ritsokiguess.site/datafiles/open-space.csv.

  1. Read in and display (some of) the data.

Solution

my_url <- "http://ritsokiguess.site/datafiles/open-space.csv"
space <- read_csv(my_url)
Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): city
dbl (2): population, open_space

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
space

Extra: Historically, an acre was one furlong by one chain; a furlong is an eighth of a mile or 220 yards, and a chain is 22 yards or 1/80 mile. In the UK, these units are still used in certain contexts. A horse race might be 1 mile and 2 furlongs (1.25 miles), and miles and chains are still used to measure distances on the railway system; each point on the railway, such as a station, is a certain number of miles and chains from a zero point. If you are a cricket fan, you will note that 1 chain is also the length of a cricket pitch. By coincidence, a furlong is very close to 200 metres in length, and thus an acre is very close to \(200 \times 20 = 4000\) square metres.

\(\blacksquare\)

  1. Make a suitable plot of these data, bearing in mind the aims of the analysis. Add a regression line to your plot.

Solution

Two quantitative variables, with the amount of open space being the response, so a scatterplot is called for. (The city is an identifier variable, so doesn’t belong in the plot.3) This being a scatterplot, it makes sense at least to try adding a regression line to the plot.4

ggplot(space, aes(x = population, y = open_space)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

Extra 1: A straight line seems to fit reasonably well. However, some of the cities are much bigger than the others.5

Extra 2: if you want to include the names of the cities on the plot, the way to do it (since they are identifier variables) is to use them to label the cities they belong to, thus, using geom_text_repel from package ggrepel:

ggplot(space, aes(x = population, y = open_space, label = city)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_text_repel()
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

The warning message is that the points at the bottom left are too close together, and so there isn’t room to draw all their labels. This could be solved by making the text smaller, or in another way that we’ll see later.

\(\blacksquare\)

  1. Fit a regression predicting open space from population, and display the results.

Solution

As expected:

space.1 <- lm(open_space ~ population, data = space)
summary(space.1)

Call:
lm(formula = open_space ~ population, data = space)

Residuals:
    Min      1Q  Median      3Q     Max 
-7283.3  -507.7   -27.4   654.0  5994.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1248.1975  1198.5499   1.041    0.322    
population     6.1050     0.4357  14.013 6.71e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3273 on 10 degrees of freedom
Multiple R-squared:  0.9515,    Adjusted R-squared:  0.9467 
F-statistic: 196.4 on 1 and 10 DF,  p-value: 6.711e-08

\(\blacksquare\)

  1. Is there a significant effect of population on the amount of open space? Explain briefly. What does this mean, in the context of the data?

Solution

This is asking about statistical significance, so you need to look at the P-value attached to the slope, which is \(6.7 \times 10^{-8}\), very small indeed. We can definitely reject the null hypothesis that the slope is zero. The estimate is 6.1; this means that a city that has a population 1 thousand bigger is expected to have 6.1 acres of additional open space.

This is, in the grand scheme of things, not very surprising. A city with a greater population will tend to take up more space, some of which will be parkland.

\(\blacksquare\)

  1. When some of the values of a variable are much larger than others, a common strategy is to take logs of all such variables (both the quantitative variables, here). Find a way to re-draw your scatterplot on log-scales for both variables. Cite your source, or describe your thought process, as appropriate. (In R, log means natural (base \(e\)) log, which is fine here.)

Solution

There are a couple of approaches:

  • redraw the plot with both axes on a log scale, without changing anything else (you will probably have to find out how to do that)
  • plot the logs of population and open space
  • make new columns containing those logs and then plot them

I think the first way is nicest, if you can find out how to do it. I found the answer here, actually two answers. The easier is this:

ggplot(space, aes(x = population, y = open_space)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() + scale_y_log10()
`geom_smooth()` using formula = 'y ~ x'

These are log scales (don’t forget to apply them to both variables), but the numbers that appear on the axes are on the original scale of the data. Note that equal space on each axis corresponds to an equal multiplicative change, so that the same distance on the axis corresponds to a bigger change between numbers as you go further along the axis.

A second way is to directly plot \(\log y\) against \(\log x\):

ggplot(space, aes(x = log(population), y = log(open_space))) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

The plot looks the same, but the numbers on the axes are (natural) logs of the data values. These are equally spaced, but not so easy to interpret.

A third way is to work out the logs, put them in new columns, and plot those:

space %>% 
  mutate(log_population = log(population), 
         log_open_space = log(open_space)) -> space_log
  ggplot(space_log, aes(x = log_population, y = log_open_space)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

I would rate these three possibilities in descending order of desirableness:

  • the first one has actual populations and open space on the axes, so is easier to interpret.
  • the second one is easier coding than the third, because the calculation takes place in the drawing of the graph (though you might argue that after the calculation is done, the plotting is easier).

Extra: you can label the cities on this one as well:

ggplot(space, aes(x = population, y = open_space, label = city)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() + scale_y_log10() +
  geom_text_repel()
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

and there is now space to label them all, because the ones on the left are not so bunched up. You see that Miami is a small city relative to the others here, but even given its size, it has a very small amount of open space.

I don’t know why the other warning is still there. The label is used for labelling the cities, and that has worked. My searching suggests that this warning is new, and sometimes it gets shown when it doesn’t need to be (the ggplot people are still working out the kinks).

\(\blacksquare\)

  1. Run a regression predicting log-open space from log-population. Display the results.

Solution

Insert the logs into the regression directly:

space.2 <- lm(log(open_space) ~ log(population), data = space)
summary(space.2)

Call:
lm(formula = log(open_space) ~ log(population), data = space)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82127 -0.06664  0.02542  0.14978  0.58396 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.8151     0.8294   3.394  0.00684 ** 
log(population)   0.8823     0.1197   7.372 2.39e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4067 on 10 degrees of freedom
Multiple R-squared:  0.8446,    Adjusted R-squared:  0.8291 
F-statistic: 54.35 on 1 and 10 DF,  p-value: 2.391e-05

Alternatively, if you saved the dataframe with the logged open space and population in it, use that instead to get to the same place:

space.2a <- lm(log_open_space ~ log_population, data = space_log)
summary(space.2a)

Call:
lm(formula = log_open_space ~ log_population, data = space_log)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82127 -0.06664  0.02542  0.14978  0.58396 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.8151     0.8294   3.394  0.00684 ** 
log_population   0.8823     0.1197   7.372 2.39e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4067 on 10 degrees of freedom
Multiple R-squared:  0.8446,    Adjusted R-squared:  0.8291 
F-statistic: 54.35 on 1 and 10 DF,  p-value: 2.391e-05

I am equally happy with either of these.

\(\blacksquare\)

  1. Using your regression of the previous part, predict the amount of open space in a city that has a population of 500 (thousand). Show your process.

Solution

This means going into and out of logs; the inverse of (natural) logs is exp.

The obvious way is to use R as a calculator:

log_pop <- log(500)
pred_log_space <- 2.8151 + 0.8823 * log_pop
pred_space <- exp(pred_log_space)
pred_space
[1] 4016.832

If you look back at the data, this seems to be roughly consistent with cities with about that population.

A less obvious but easier way involves using predict. I used this when I compared the three windmill models for wind velocities outside the range of the data. But the idea can be used for any predictions at all. We just have to remember here to take logs and anti-logs in the right places:

new <- tibble(population = 500)
p <- predict(space.2, new)
p
       1 
8.298413 

That’s the (natural) log of the amount of open space. (The prediction took the log of 500 for us since the regression had log of population in it.) So now we need to anti-log this:

exp(p)
       1 
4017.493 

A more accurate version of what we got before, since it used all the significant digits, not just the four or five that appeared in the summary table. You’ll need an answer that rounds to 4017, whichever way you did it.

\(\blacksquare\)

2 Heights of parents and children

In 1886, William Galton recorded the heights of 928 (adult) children and their parents’ mid-height (that is, the average of the father’s and mother’s heights). Each height was measured in inches and classified into intervals of width (mostly) 1 inch. The data are in http://ritsokiguess.site/datafiles/Galton.csv.

  1. Read in and display (some of) the data.

  2. Make a suitable plot of your dataframe.

  3. There are 928 individuals, but apparently fewer points than that on your graph. What happened to the others?

  4. Re-draw your graph, but this time using geom_jitter to plot the points. Add a smooth trend. What seems to have happened, and what does the smooth trend show?

  5. Obtain the regression line for predicting child’s height from parent’s mid-height. (That is, fit the regression model and display the output.)

  6. The paper that Galton wrote about these data was called “Regression Towards Mediocrity in Hereditary Stature”. This seems like an obvious title, since we just did a regression analysis, but it was actually the first ever use of the term “regression” in a statistical context. Find the dictionary definition of “regression” that Galton would have used, citing your source. Explain briefly what Galton meant by his article title. Hint: “towards mediocrity” means “towards the mean”. You might like to do some predictions.

  7. Obtain (i) a plot of residuals against fitted values, (ii) a normal quantile plot of residuals. How do you know that these are both satisfactory?

Heights of parents and children: my solutions

In 1886, William Galton recorded the heights of 928 (adult) children and their parents’ mid-height (that is, the average of the father’s and mother’s heights). Each height was measured in inches and classified into intervals of width (mostly) 1 inch. The data are in http://ritsokiguess.site/datafiles/Galton.csv.

  1. Read in and display (some of) the data.

Solution

Exactly as usual:

my_url <- "http://ritsokiguess.site/datafiles/Galton.csv"
Galton <- read_csv(my_url)
Rows: 928 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): parent, child

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Galton

I used an uppercase G on the name of my dataframe, which I will have to remember to type later. You don’t need to.

There are two quantitative columns, containing (respectively) the parents’ mid-height and the child’s height. There are indeed 928 children.6

The data come from the dataset Galton in the package HistData that contains datasets of historical importance.

Extra: Galton’s work was revisited in 2004, and some pages from Galton’s notebook were scanned, for example this one, in which you can admire Galton’s handwriting. Something you can note here: if Galton had written every height out in full, he would have written a lot of 6s (a lot of the heights were between 5 and 6 feet, 60 and 72 inches), so he wrote down the deviations from 60 inches, which was a lot less writing.

\(\blacksquare\)

  1. Make a suitable plot of your dataframe.

Solution

Two quantitative variables, so a scatterplot. No problems here:

ggplot(Galton, aes(x = parent, y = child)) + geom_point()

Add a smooth trend if you like, but we have some more work to do first. (I have no objections if you add a smooth trend here.) I think it makes more sense to think of the child’s height as an “outcome” and so that belongs on the \(y\)-axis (but see an Extra below).

\(\blacksquare\)

  1. There are 928 individuals, but apparently fewer points than that on your graph. What happened to the others?

Solution

The clue is that Galton used classes (bins) of mostly 1-inch width, so that there would have been a lot of children and parents that had the same heights (on both variables), for example rows 6, 7, and 8 of the dataframe you read in and displayed. So these observations would have plotted in exactly the same place on the plot and you wouldn’t have seen that there were more than one of them.

Extra: Galton’s solution to this was as shown in page 111 here (use your UTorID and password to access): instead of plotting points, he plotted numbers which were how many observations appeared at that point. We can actually mimic the idea of his plot:

Galton %>% 
  count(parent, child) %>% 
  ggplot(aes(x = child, y = parent, label = n)) + geom_text()

Galton plotted the children’s heights on the \(x\)-axis, and appears to have used only a subset of his data to make the plot there. I suspect that the idea of plotting the response on the \(y\)-axis came historically much later.

The coding strategy here is to not use geom_point, but instead to use geom_text: instead of plotting a black circle, we plot some text, specifically the text passed in as label, here the column n that contains the frequencies of each parent-child height combination.

In the next part, we explore a different solution.

\(\blacksquare\)

  1. Re-draw your graph, but this time using geom_jitter to plot the points. Add a smooth trend. What seems to have happened, and what does the smooth trend show?

Solution

Replace geom_point with geom_jitter and add a smooth trend:

ggplot(Galton, aes(x = parent, y = child)) + geom_jitter() + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The jittering has spread out the points so that you can see them all (or, at least, get a much better sense of how many of them there are). Specifically, geom_jitter randomly moves the points a maximum of halfway to the next value, so that (for example) the points that were originally went with a parent’s mid-height of just over 70 inches are not in danger of being confused with the ones near 69 or 71 inches (and likewise with the children’s heights).

As for the smooth trend, this shows an upward relationship (taller parents have taller children on average) that is a little curved but close to straight. The correlation appears to be moderate at best. (The grey envelope, if you drew it, is narrow because there are a lot of observations.)

\(\blacksquare\)

  1. Obtain the regression line for predicting child’s height from parent’s mid-height. (That is, fit the regression model and display the output.)

Solution

Galton.1 <- lm(child ~ parent, data = Galton)
summary(Galton.1)

Call:
lm(formula = child ~ parent, data = Galton)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8050 -1.3661  0.0487  1.6339  5.9264 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.94153    2.81088   8.517   <2e-16 ***
parent       0.64629    0.04114  15.711   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.239 on 926 degrees of freedom
Multiple R-squared:  0.2105,    Adjusted R-squared:  0.2096 
F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16

I didn’t ask, but: the R-squared is not high, going with a correlation of about 0.4, which seems to fit with the jittered scatterplot, but the slope is strongly significantly different from zero, indicating that taller parents really do have taller children on average.

\(\blacksquare\)

  1. The paper that Galton wrote about these data was called “Regression Towards Mediocrity in Hereditary Stature”. This seems like an obvious title, since we just did a regression analysis, but it was actually the first ever use of the term “regression” in a statistical context.

Find the dictionary definition of “regression” that Galton would have used, citing your source. Explain briefly what Galton meant by his article title. Hint: “towards mediocrity” means “towards the mean”. You might like to do some predictions.

Solution

The easiest way to find a definition for the word is to type “define regression” into Google. Remember, you’re looking for a non-statistical definition, and the best one seems to be “a return to a former or less developed state”. That seems to come from here. Or, you can use dictionary.com, where the first definition is “the act of going back to a previous place or state; return or reversion”. The sense you want is of “going back”: that is, the opposite of “progression”.

So, Galton seems to mean a “going back towards the mean”. To see what he means by that, let’s do some predictions. I think predicting for a high value (of parent’s mid-height) and a low value will show up what’s going on. Let’s try 64 and 72 inches.

Find a way of doing the predictions, at least approximately. One way is to add the regression line to your graph from earlier:

ggplot(Galton, aes(x = parent, y = child)) + geom_jitter() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

For parents with mid-height 64 inches, the child is predicted to be just over 65 inches tall. For parents with mid-height 72 inches, the child is predicted to be just over 70 inches tall.

The way you probably thought of is to use your calculator and put the two values into the regression equation. Or you can use R as a calculator:

23.94153 + 0.64629 * 64
[1] 65.30409
23.94153 + 0.64629 * 72
[1] 70.47441

A perhaps better way to predict (though the other ways are equally good for our purposes here, and I probably won’t have gotten to this in lecture in time) is on slide 40 of the windmill case study: set up a little dataframe with the desired parent heights in it:

new <- tribble(
  ~parent,
  64,
  72
)
new

and then feed that into predict with the model name:

predict(Galton.1, new)
       1        2 
65.30413 70.47445 

Finally, explain how these are “regressing towards the mean”: when the parents are short, the child is also predicted to be short, but not as short as their parents: that is, closer to the average height of all children. When the parents are tall, the child is predicted to be tall, but not as tall as their parents: once again, closer to the average height of all children.

Another way to get this is to look at the value of the slope, which is 0.65, and then:

The regression line always goes through the “point of averages”: that is, if the parents’ mid-height is at the mean, their child’s height is also predicted to be at the mean. If the parents’ mid-height goes up by 1 inch, the predicted height of their child also goes up, but only by 0.65 inch. This is another way to see that a child of taller-than-average parents is also predicted to be taller than average, but not by as much: the child’s height will tend to be closer to the mean than their parents’ mid-height was.

The same argument is made by going down from the mean: if the parents’ mid-height is 1 inch less than the mean, the child’s height is predicted to be 0.65 inches less than the mean: again, shorter than average, but not by as much.

Extra: This is exactly the phenomenon of “regression towards the mean”, something you may have heard of before: when the \(x\) is unusual, the \(y\) is also predicted to be unusual, but not as unusual as the \(x\) was. You can see now that use of “regression” in “regression towards the mean” is in the old sense, not the statistical sense.

If you looked up the definition of “regression” on Google, you might also have seen a usage graph over time. “Regression” was barely used before 1900 (that is, in Galton’s time), but has become much more popular since about 1950, mostly in the statistical sense. “Regression” came to be used as the statistical term for the method used to illustrate regression towards the mean. It’s amusing to think that the statistical term originated with the exact data set that we just looked at.

\(\blacksquare\)

  1. Obtain (i) a plot of residuals against fitted values, (ii) a normal quantile plot of residuals. How do you know that these are both satisfactory?

Solution

Use the fitted model object (what I called Galton.1) as if it were a dataframe to make the plots. For the first one, it is better to use geom_jitter again to show all the points:

ggplot(Galton.1, aes(x = .fitted, y = .resid)) + geom_jitter()

This appears to be a random collection of points with no pattern, which is what we want. If you like, add a smooth trend to it:

ggplot(Galton.1, aes(x = .fitted, y = .resid)) + geom_jitter() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This wiggles a bit (to reflect the wiggle on the scatterplot: the trend is not quite linear), but the grey envelope includes zero almost all the way across, so I call this good.

The normal quantile plot:

ggplot(Galton.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

The points are all close to the line, indicating a normal distribution for the residuals. If you want to mention the upper tail, you can say that this tail is short, so there will be no problem in the regression. (A long tail, or outliers, could be a problem, but we don’t have that here.)

Extra: the collections of horizontal points on the first two plots are actually points that would plot in exactly the same place were they not jittered: that is, they have the same residual and fitted value. On the normal quantile plot, the “stairsteps” indicate, likewise, multiple residuals that are equal. Both of these are features of the data, not indicative of problems.

\(\blacksquare\)

3 Houses in Duke Forest, North Carolina

The data in http://ritsokiguess.site/datafiles/duke_forest.csv are of houses that were sold around November 2020 in the Duke Forest area of Durham, North Carolina. For each house, the selling price (in US $), called price, was recorded, along with some other features of the house:

  • bed: the number of bedrooms
  • bath: the number of bathrooms
  • area: the area of the inside of the house, in square feet
  • year_built: the year the house was originally built

Our aim is to predict the selling price of a house from its other features. There are 97 houses in the data set.

Note: this is rather a long question, but I wanted to give you a chance to practice everything.

  1. Read in and display (some of) the data.

  2. Make a graph of selling price against each of the explanatory variables, using one ggplot line.

  3. Comment briefly on your plots.

  4. Fit a regression predicting price from the other variables, and display the results.

  5. What is the meaning of the number in the bath row in the Estimate column?

  6. Plot the residuals from your regression against the fitted values. What evidence is there that a transformation of the selling prices might be a good idea? (Hint: look at the right side of your graph.)

  7. Run Box-Cox. What transformation of price is suggested, if any?

  8. Rerun your regression with a suitably transformed response variable, and display the results.

  9. Confirm that the plot of residuals against fitted values now looks better.

  10. Build a better model by removing any explanatory variables that play no role, one at a time.

  11. If you want to, make a full set of residual plots for your final model (residuals vs fitted values, normal quantile plot of residuals, residuals vs all the explanatory) and convince yourself that all is now at least reasonably good. (I allow for the possibility that you are now bored with this question and would like to move on to something else, but I had already done these, so…)

Houses in Duke Forest, North Carolina: my solutions

The data in http://ritsokiguess.site/datafiles/duke_forest.csv are of houses that were sold around November 2020 in the Duke Forest area of Durham, North Carolina. For each house, the selling price (in US $), called price, was recorded, along with some other features of the house:

  • bed: the number of bedrooms
  • bath: the number of bathrooms
  • area: the area of the inside of the house, in square feet
  • year_built: the year the house was originally built

Our aim is to predict the selling price of a house from its other features. There are 97 houses in the data set.

Note: this is rather a long question, but I wanted to give you a chance to practice everything.

(a) Read in and display (some of) the data.

Solution

Suggestion, before you start: we’ll be using MASS later, so install and load that first, before even tidyverse. Then load broom as well if you plan to use that later.

The reading is a big surprise, I don’t think:

my_url <- "http://ritsokiguess.site/datafiles/duke_forest.csv"
duke_forest <- read_csv(my_url)
Rows: 97 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): price, bed, bath, area, year_built

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
duke_forest

There are indeed 97 rows and the variables promised.

(b) Make a graph of selling price against each of the explanatory variables, using one ggplot line.

Solution

For a plot, we want the selling price in one column (it already is) and all the explanatory variable values in another column, with a third column labelling which explanatory variable it is. Then, to make the plot, plot selling price against the \(x\)-column, making facets according to the names of the \(x\)-variables.

This is done by pivoting longer first. I also always forget (until I see the graph) that a scales = "free" is needed:

duke_forest %>% 
  pivot_longer(-price, names_to ="xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = price)) + geom_point() + 
    facet_wrap(~xname, scales = "free")

If that code is too much for you, run the pivot-longer first and look at the results. Then think about how you would use that dataframe to make scatterplots of selling price against each of those explanatory variables.

The reason for the scales = "free" is that each of the explanatory variables is on its own scale, so you want the graph to have a different scale for each of them. For example, the year is a number like 2000, but the number of bedrooms is a number like 4. You can see, by taking out the scales = "free", that it doesn’t work very well to have all of them on the same scale!

(c) Comment briefly on your plots.

Solution

All of them except for year_built (bottom right) show an upward apparently linear trend, with at least one outlier. I don’t think there is any trend with the age of the house.

This makes sense, because you would expect a house with more bedrooms or bathrooms, and a larger square footage, to be more expensive to buy.

That’s about as much as you need to say. There are not any problems with non-linearity here.

Extra: One of the houses was sold for near $1.4 million, and is at the top of each of these plots. This house has three bedrooms and four bathrooms, which makes you wonder why it sold for so much, until you look at the area plot, where it is at the top right: a huge house that happens to have only three bedrooms and four bathrooms. The hugeness is what made it so expensive.

(d) Fit a regression predicting price from the other variables, and display the results.

Solution

price.1 <- lm(price ~ area + bath + bed + year_built, data = duke_forest)
summary(price.1)

Call:
lm(formula = price ~ area + bath + bed + year_built, data = duke_forest)

Residuals:
    Min      1Q  Median      3Q     Max 
-446286  -76678  -11789   77958  442108 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -308181.85 1623259.49  -0.190  0.84984    
area            148.05      21.91   6.756 1.27e-09 ***
bath          70752.06   23685.14   2.987  0.00361 ** 
bed          -33713.81   25488.64  -1.323  0.18921    
year_built      189.11     832.21   0.227  0.82074    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141600 on 92 degrees of freedom
Multiple R-squared:  0.6082,    Adjusted R-squared:  0.5912 
F-statistic: 35.71 on 4 and 92 DF,  p-value: < 2.2e-16

Or, if you prefer,

glance(price.1)
tidy(price.1)

or just the second one of those.

(e) What is the meaning of the number in the bath row in the Estimate column?

Solution

This value 70752 is the slope of bath in this multiple regression. It means that a house with an extra bathroom but the same number of bedrooms, same amount of square footage and built in the same year, is predicted to sell for just over $70,000 more. (The “all else equal” is important.)

This is, you might say, the “value” of an extra bathroom, in that if somebody wanted to remodel a house to add an extra bathroom, which would of course cost something, they could balance the construction cost with this slope to decide whether adding an extra bathroom is worth doing before they sell the house.

Extra: The coefficient for bed is negative, which appears to say that having another bedroom decreases the selling price, but look at the non-significant P-value: the slope coefficient for bed is actually not significantly different from zero. That is, once you know the other things, having an extra bedroom does not significantly increase the selling price.7 The reason for this is probably that a house with more bathrooms will also tend to have more bedrooms: that is, that bathrooms and bedrooms will be positively correlated. In regression, having correlated explanatory variables can cause some of them to be less significant than you would expect (“multicollinearity”). The idea is that a house with more bathrooms likely has more bedrooms, and you can infer this without actually needing to know how many bedrooms it has.

(f) Plot the residuals from your regression against the fitted values. What evidence is there that a transformation of the selling prices might be a good idea? (Hint: look at the right side of your graph.)

Solution

Use the fitted model object as if it were a dataframe:

ggplot(price.1, aes(x = .fitted, y = .resid)) + geom_point()

This looks pretty close to random, but if you look at the right side of the graph, you’ll see that the most extreme residuals (the two most positive ones and at least the four most negative ones) are all on the right side and not on the left. There is thus a little evidence of fanning out, which is the sort of thing a transformation of the response variable can help with. The flip side of that is that most of the residuals close to zero are on the left rather than the right, although admittedly there are more points on the left as well.

If you think about prices, this actually makes sense: higher prices are likely to be more variable, because a small change in a larger price is likely to be a larger number of dollars. Put another way, a change in something like the selling price of a house is likely to be expressed as a percentage, and a change of say 10% is a larger number of dollars if the price is larger to begin with.

If a percentage change is meaningful here, the right transformation is to take logs. We’ll see what gets suggested shortly.

(g) Run Box-Cox. What transformation of price is suggested, if any?

Solution

I suggest you copy-paste your lm from above, and change the lm to boxcox (you don’t need to change anything else). Most of what follows you can do by copying, pasting and editing:

boxcox(price ~ area + bath + bed + year_built, data = duke_forest)

The peak of the graph is close to 0.5, and the confidence interval goes from about 0.2 to about 0.8. We are looking for a whole number or a whole number plus a half for our transformation, and the only one that fits the bill is 0.5. Power 0.5 is square root, so that is the transformation we will use.

Note that the graph rules out power 1 (“do nothing”, since raising to a power 1 doesn’t change anything), and also “power 0” (take logs), because these are both outside the interval.

(h) Rerun your regression with a suitably transformed response variable, and display the results.

Solution

This, or use (at least) tidy from broom again. Copy, paste, and insert the sqrt, changing the model number:

price.2 <- lm(sqrt(price) ~ area + bath + bed + year_built, data = duke_forest)
summary(price.2)

Call:
lm(formula = sqrt(price) ~ area + bath + bed + year_built, data = duke_forest)

Residuals:
     Min       1Q   Median       3Q      Max 
-268.536  -45.398    2.849   56.032  215.263 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -178.16596 1055.18390  -0.169 0.866287    
area           0.08654    0.01425   6.075 2.77e-08 ***
bath          52.55888   15.39629   3.414 0.000955 ***
bed          -17.44161   16.56864  -1.053 0.295241    
year_built     0.29488    0.54097   0.545 0.587010    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.08 on 92 degrees of freedom
Multiple R-squared:  0.6056,    Adjusted R-squared:  0.5884 
F-statistic: 35.32 on 4 and 92 DF,  p-value: < 2.2e-16

(i) Confirm that the plot of residuals against fitted values now looks better.

Solution

Copy-paste the previous one and change the previous model to the current one:

ggplot(price.2, aes(x = .fitted, y = .resid)) + geom_point()

The most extreme residuals no longer seem to be in any predictable place (sometimes left, sometimes right), and knowing the fitted value doesn’t seem to say anything about what the residual might be, so I say this is better.

(j) Build a better model by removing any explanatory variables that play no role, one at a time.

Solution

There are two ways to go here: copy-paste your previous regression and literally remove what you want to remove, or use update, which is a bit more concise, and is what I did. The first \(x\)-variable to come out is the one with the highest P-value, which is year_built (the intercept always stays):

price.3 <- update(price.2, .~. -year_built)
summary(price.3)

Call:
lm(formula = sqrt(price) ~ area + bath + bed, data = duke_forest)

Residuals:
     Min       1Q   Median       3Q      Max 
-269.014  -48.644    3.322   53.145  214.244 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 396.41865   47.48641   8.348 6.32e-13 ***
area          0.08617    0.01418   6.079 2.64e-08 ***
bath         54.88481   14.73717   3.724 0.000336 ***
bed         -17.64005   16.50193  -1.069 0.287850    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91.73 on 93 degrees of freedom
Multiple R-squared:  0.6043,    Adjusted R-squared:  0.5916 
F-statistic: 47.35 on 3 and 93 DF,  p-value: < 2.2e-16

bed is still non-significant, so it comes out next:

price.4 <- update(price.3, .~. -bed)
summary(price.4)

Call:
lm(formula = sqrt(price) ~ area + bath, data = duke_forest)

Residuals:
     Min       1Q   Median       3Q      Max 
-265.790  -52.915    0.374   56.264  205.081 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 360.91621   33.96588  10.626  < 2e-16 ***
area          0.08202    0.01364   6.011 3.48e-08 ***
bath         48.71751   13.57120   3.590 0.000528 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91.8 on 94 degrees of freedom
Multiple R-squared:  0.5995,    Adjusted R-squared:  0.5909 
F-statistic: 70.34 on 2 and 94 DF,  p-value: < 2.2e-16

and now everything is significant, so here we stop.

The reason for doing this is that we are trying to understand what selling price really depends upon, and now we have our answer. Machine-learning people say that the model with all four \(x\)s is “over-fitted”, meaning that results from it wouldn’t generalize to other house prices. Now that we have only significant explanatory variables, we can be pretty confident that these will generalize.

(k) If you want to, make a full set of residual plots for your final model (residuals vs fitted values, normal quantile plot of residuals, residuals vs all the explanatory) and convince yourself that all is now at least reasonably good. (I allow for the possibility that you are now bored with this question and would like to move on to something else, but I had already done these, so…)

Solution

Once again, copy, paste, and edit from your previous work. Residuals vs. fitted values:

ggplot(price.4, aes(x = .fitted, y = .resid)) + geom_point()

Normal quantile plot of residuals:

ggplot(price.4, aes(sample = .resid)) + stat_qq() + stat_qq_line()

Residuals vs \(x\)s. This is very like what you did in (b), only now using the residuals rather than the response. Remember that the residuals come from the fitted model object, but the explanatory variable values come from the original data, so the idea is to take the model and augment the data:

price.4 %>% augment(duke_forest) %>% 
  pivot_longer(bed:year_built, names_to = "xname", values_to = "x") %>% 
  ggplot(aes(x = x, y = .resid)) + geom_point() + 
    facet_wrap(~xname, scales = "free")

I think these are all good enough. The residuals vs fitted values looks random; the normal quantile plot has slightly long tails but not really enough to worry about; the residuals vs \(x\)s look random, allowing for the discrete distributions of bath and bed.8

The reason for including the \(x\)s we dropped from the model in the last graph is that any non-linear dependence on these \(x\)s will show up here, and we could go back and add that \(x\) squared if needed. Here, though, there are no patterns at all in the two graphs in the bottom row, so we can be confident that we got it right.

Footnotes

  1. One acre is approximately 0.4 hectares or 4000 square metres.↩︎

  2. One acre is approximately 0.4 hectares or 4000 square metres.↩︎

  3. But see Extra 2.↩︎

  4. I tried a smooth trend, but this gets deceived by the points out on their own on the right.↩︎

  5. The point at the top right with both the largest population and the largest amount of open space is New York. If this city had more or less open space than it actually did, the line could move appreciably.↩︎

  6. Though from only 205 families, which a more careful analysis would take into account; these are not really 928 independent children’s heights.↩︎

  7. If you didn’t know the other things, the number of bedrooms would undoubtedly have an impact, but it does not, over and above those other things.↩︎

  8. You can see that some of the houses have 2.5 bathrooms, and you might be wondering how that can be. A half-bathroom, according to this, is a bathroom without an actual bath (or shower), typically a toilet and washbasin. You might have one on the main floor of the house, to save you or your guests having to go upstairs.↩︎