Worksheet 11

Published

November 17, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 The boiling point of water

The boiling point of water is commonly known to be 100 degrees C (212 degrees F). But the actual boiling point of water depends on atmospheric pressure; when the pressure is lower, the boiling point is also lower. For example, at higher altitudes, the atmospheric pressure is lower because the air is thinner, so that in Denver, Colorado, which is 1600 metres above sea level, water boils at around 95 degrees C. Source.

Some data were collected on the atmospheric pressure at seventeen locations (pressure in the data file, in inches of mercury) and the boiling temperature of water at those locations (boiling, in degrees F). This is (evidently) American data. The data are in http://ritsokiguess.site/datafiles/boiling-point.csv. Our aim is to predict boiling point from atmospheric pressure.

There are rather a lot of parts here. Questions like this tend to have rather a lot to explore. I wouldn’t necessarily put all these on an assignment, but it is certainly worth your while to work through these now, in case something like them comes up again later.

  1. Read in and display (some of) the data.

  2. Draw a suitable plot of these data.

  3. Comment briefly on your plot and any observations that appear not to belong to the pattern shown by the rest of the observations.

  4. Fit a suitable linear regression, and display the results. (Do this even if you think this is not appropriate.)

  5. Comment briefly on whether the slope makes sense, and on the overall fit of the model.

  6. Make two suitable plots than can be used to assess the appropriateness of this regression.

  7. When you looked at your scatterplot, you may have identified some observations that did not follow the pattern of the others. Describe briefly how these observations show up on the two plots you just drew.

  8. It turns out that the two observations with the lowest pressure are errors. Create a new dataframe with these observations removed, and repeat the regression. (You do not need to make the residual plots.)

  9. Compare the slope and the R-squared from this regression with the values you got in the first regression. Why is it sensible that the values differ in the way they do?

The boiling point of water - my solutions

The boiling point of water is commonly known to be 100 degrees C (212 degrees F). But the actual boiling point of water depends on atmospheric pressure; when the pressure is lower, the boiling point is also lower. For example, at higher altitudes, the atmospheric pressure is lower because the air is thinner, so that in Denver, Colorado, which is 1600 metres above sea level, water boils at around 95 degrees C. Source.

Some data were collected on the atmospheric pressure at seventeen locations (pressure in the data file, in inches of mercury) and the boiling temperature of water at those locations (boiling, in degrees F). This is (evidently) American data. The data are in http://ritsokiguess.site/datafiles/boiling-point.csv. Our aim is to predict boiling point from atmospheric pressure.

  1. Read in and display (some of) the data.

Solution

This is a .csv file, so no great difficulty:

my_url <- "http://ritsokiguess.site/datafiles/boiling-point.csv"
boiling_point <- read_csv(my_url)
Rows: 17 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): boiling, pressure

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
boiling_point

There are correctly 17 rows of data with two columns boiling and pressure as promised.

\(\blacksquare\)

  1. Draw a suitable plot of these data.

Solution

There are two quantitative variables, so a scatterplot is the thing. We are trying to predict boiling point, so that goes on the \(y\)-axis:

ggplot(boiling_point, aes(x = pressure, y = boiling)) + geom_point() + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Adding a smooth trend will help us to get a sense of what kind of relationship there is. This is better than adding a regression line to this plot, because we don’t know whether the relationship is straight.

Comment on this one below.

\(\blacksquare\)

  1. Comment briefly on your plot and any observations that appear not to belong to the pattern shown by the rest of the observations.

Solution

There is a clear upward trend, one that appears to be linear, apart from the two observations on the left that have a higher boiling point than you would expect for such a low pressure.

Do not, at this point, suggest removing these points. If they are genuine, they have to be included in our model. Removing them for no reason (other than “they don’t fit a line”) is scientific dishonesty, because you want a model that will generalize to future observations, including ones like those on the left. If you want to make a suggestion, suggest checking whether those points on the left are errors, and if they are errors, removing them. That is the only justification for removing data points: in that case, they are not like the others, for a reason you can explain.

\(\blacksquare\)

  1. Fit a suitable linear regression, and display the results. (Do this even if you think this is not appropriate.)

Solution

You probably think that those points over on the left mean that we should not fit a regression at all (which is the reason for my parenthetical comment in the question), but fit the regression here anyway. We will critique it shortly.

boiling_point.1 <- lm(boiling ~ pressure, data = boiling_point)
summary(boiling_point.1)

Call:
lm(formula = boiling ~ pressure, data = boiling_point)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.41483 -0.91550 -0.05148  0.76941  2.72840 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 163.9307     2.6551   61.74  < 2e-16 ***
pressure      1.5796     0.1051   15.04 1.88e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.291 on 15 degrees of freedom
Multiple R-squared:  0.9378,    Adjusted R-squared:  0.9336 
F-statistic:   226 on 1 and 15 DF,  p-value: 1.879e-10

\(\blacksquare\)

  1. Comment briefly on whether the slope makes sense, and on the overall fit of the model.

Solution

There are actually three things I would like to see:

  • the slope is positive, as predicted in the question (a lower pressure goes with a lower boiling point, and thus higher with higher also).
  • the slope is (very strongly) significant, meaning that the positive slope here is far from just chance.
  • the R-squared is 94%, meaning that the regression fits well even despite the two points on the left of the scatterplot.

\(\blacksquare\)

  1. Make two suitable plots than can be used to assess the appropriateness of this regression.

Solution

The two plots in question are of the residuals vs. fitted values, and a normal quantile plot of the residuals. Don’t forget to use your fitted model object (my boiling_point.1) as the “dataframe” to feed into ggplot:

ggplot(boiling_point.1, aes(x = .fitted, y = .resid)) + geom_point()

and

ggplot(boiling_point.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

\(\blacksquare\)

  1. When you looked at your scatterplot, you may have identified some observations that did not follow the pattern of the others. Describe briefly how these observations show up on the two plots you just drew.

Solution

On the scatterpoint, we saw two points that were off a linear trend on the left. These had low pressure and a higher boiling point than expected given that low pressure. These will show up as very positive residuals:

  • on the residuals vs fitted values, they are up in the top left corner
  • on the normal quantile plot of residuals, they are the two most positive residuals, at the top right.

\(\blacksquare\)

  1. It turns out that the two observations with the lowest pressure are errors. Create a new dataframe with these observations removed, and repeat the regression. (You do not need to make the residual plots.)

Solution

Find a way to select the two observations with the lowest pressure. The easiest way is to note that their pressure values, whatever they are, must be less than about 21.25 (the major ticks on my scatterplot are 2.5, so the minor ticks are 1.25, and so the vertical line in the graph background is 1.25 less than 22.5, ie. at 21.25.) That would give this:

boiling_point %>% 
  filter(pressure > 21.25) -> boiling_point_no_lo
boiling_point_no_lo

Correctly 15 rows left.

You could use anything between 21 and 22 as your dividing line.

Another way is to sort on pressure and then remove the two lowest ones:

boiling_point %>% arrange(pressure) %>% 
  slice(-(1:2))

slice with negative input says “everything but those rows”, or you could feed slice rows 3 through 17 of the original dataframe.

The data values are already sorted by pressure, in fact, so if you don’t do the sorting yourself you need to look and see that the two values you want to remove actually are the first two.

A third way is to use slice_max. It combines the slice with the sorting, so that you can do it all in one step:

boiling_point %>% 
  slice_max(pressure, n = 15)

The n = 15 is to say that you want to keep the \(17-2 = 15\) rows that have the largest pressure values, which is the same thing as throwing away the two rows with the smallest ones.

I had a hard time naming this dataset. I didn’t want to use numbers because I am using those for models.

All right, the regression:

boiling_point.2 <- lm(boiling ~ pressure, data = boiling_point_no_lo)
summary(boiling_point.2)

Call:
lm(formula = boiling ~ pressure, data = boiling_point_no_lo)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.16358 -0.22154  0.01851  0.21714  0.76407 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 157.54528    1.24593  126.45  < 2e-16 ***
pressure      1.81477    0.04827   37.59 1.19e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5041 on 13 degrees of freedom
Multiple R-squared:  0.9909,    Adjusted R-squared:  0.9902 
F-statistic:  1413 on 1 and 13 DF,  p-value: 1.193e-14

\(\blacksquare\)

  1. Compare the slope and the R-squared from this regression with the values you got in the first regression. Why is it sensible that the values differ in the way they do?

Solution

I’m going to make a little Markdown table out of the values, for ease of comparison:1

Regression Slope R-squared
Regression 1 1.580 0.938
Regression 2 1.815 0.991

The second regression has both a more positive slope and a bigger R-squared than the first one.

The slope is bigger (more positive) after we remove the two observations because they had a higher than expected boiling point, and that pulled the line up on the left compared to the regression without those two observations. Thus the first line is less steep than the second, meaning that it has a less positive slope.

The R-squared is bigger because all the points, apart from the two we removed, are very2 close to a straight-line trend, and so the second regression should be, and is, very close to a perfect fit.

\(\blacksquare\)

2 Houses in Duke Forest, North Carolina

The data in http://ritsokiguess.site/datafiles/duke_forest.csv are of houses that were sold around November 2020 in the Duke Forest area of Durham, North Carolina. For each house, the selling price (in US $), called price, was recorded, along with some other features of the house:

  • bed: the number of bedrooms
  • bath: the number of bathrooms
  • area: the area of the inside of the house, in square feet
  • year_built: the year the house was originally built

Our aim is to predict the selling price of a house from its other features. There are 97 houses in the data set.

Note: this is rather a long question, but I wanted to give you a chance to practice everything. Some of the things in here you may not have seen by the end of Tuesday’s lecture; we’ll get to those on Thursday, so you can wait until then to practice those things, if you wish.

  1. Read in and display (some of) the data.

  2. Make a graph of selling price against each of the explanatory variables, using one ggplot line.

  3. Comment briefly on your plots.

  4. Fit a regression predicting price from the other variables, and display the results.

  5. What is the meaning of the number in the bath row in the Estimate column?

  6. Plot the residuals from your regression against the fitted values. What evidence is there that a transformation of the selling prices might be a good idea? (Hint: look at the right side of your graph.)

  7. Run Box-Cox. What transformation of price is suggested, if any?

  8. Rerun your regression with a suitably transformed response variable, and display the results.

  9. Confirm that the plot of residuals against fitted values now looks better.

  10. Build a better model by removing any explanatory variables that play no role, one at a time.

  11. If you want to, make a full set of residual plots for your final model (residuals vs fitted values, normal quantile plot of residuals, residuals vs all the explanatory) and convince yourself that all is now at least reasonably good. (I allow for the possibility that you are now bored with this question and would like to move on to something else, but I had already done these, so…)

Houses in Duke Forest, North Carolina: my solutions

The data in http://ritsokiguess.site/datafiles/duke_forest.csv are of houses that were sold around November 2020 in the Duke Forest area of Durham, North Carolina. For each house, the selling price (in US $), called price, was recorded, along with some other features of the house:

  • bed: the number of bedrooms
  • bath: the number of bathrooms
  • area: the area of the inside of the house, in square feet
  • year_built: the year the house was originally built

Our aim is to predict the selling price of a house from its other features. There are 97 houses in the data set.

Note: this is rather a long question, but I wanted to give you a chance to practice everything.

(a) Read in and display (some of) the data.

Solution

Suggestion, before you start: we’ll be using MASS later, so install and load that first, before even tidyverse. Then load broom as well if you plan to use that later.

The reading is a big surprise, I don’t think:

my_url <- "http://ritsokiguess.site/datafiles/duke_forest.csv"
duke_forest <- read_csv(my_url)
Rows: 97 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): price, bed, bath, area, year_built

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
duke_forest

There are indeed 97 rows and the variables promised.

(b) Make a graph of selling price against each of the explanatory variables, using one ggplot line.

Solution

For a plot, we want the selling price in one column (it already is) and all the explanatory variable values in another column, with a third column labelling which explanatory variable it is. Then, to make the plot, plot selling price against the \(x\)-column, making facets according to the names of the \(x\)-variables.

This is done by pivoting longer first. I also always forget (until I see the graph) that a scales = "free" is needed:

duke_forest %>% 
  pivot_longer(-price, names_to ="xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = price)) + geom_point() + 
    facet_wrap(~xname, scales = "free")

If that code is too much for you, run the pivot-longer first and look at the results. Then think about how you would use that dataframe to make scatterplots of selling price against each of those explanatory variables.

The reason for the scales = "free" is that each of the explanatory variables is on its own scale, so you want the graph to have a different scale for each of them. For example, the year is a number like 2000, but the number of bedrooms is a number like 4. You can see, by taking out the scales = "free", that it doesn’t work very well to have all of them on the same scale!

(c) Comment briefly on your plots.

Solution

All of them except for year_built (bottom right) show an upward apparently linear trend, with at least one outlier. I don’t think there is any trend with the age of the house.

This makes sense, because you would expect a house with more bedrooms or bathrooms, and a larger square footage, to be more expensive to buy.

That’s about as much as you need to say. There are not any problems with non-linearity here.

Extra: One of the houses was sold for near $1.4 million, and is at the top of each of these plots. This house has three bedrooms and four bathrooms, which makes you wonder why it sold for so much, until you look at the area plot, where it is at the top right: a huge house that happens to have only three bedrooms and four bathrooms. The hugeness is what made it so expensive.

(d) Fit a regression predicting price from the other variables, and display the results.

Solution

price.1 <- lm(price ~ area + bath + bed + year_built, data = duke_forest)
summary(price.1)

Call:
lm(formula = price ~ area + bath + bed + year_built, data = duke_forest)

Residuals:
    Min      1Q  Median      3Q     Max 
-446286  -76678  -11789   77958  442108 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -308181.85 1623259.49  -0.190  0.84984    
area            148.05      21.91   6.756 1.27e-09 ***
bath          70752.06   23685.14   2.987  0.00361 ** 
bed          -33713.81   25488.64  -1.323  0.18921    
year_built      189.11     832.21   0.227  0.82074    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141600 on 92 degrees of freedom
Multiple R-squared:  0.6082,    Adjusted R-squared:  0.5912 
F-statistic: 35.71 on 4 and 92 DF,  p-value: < 2.2e-16

Or, if you prefer,

glance(price.1)
tidy(price.1)

or just the second one of those.

(e) What is the meaning of the number in the bath row in the Estimate column?

Solution

This value 70752 is the slope of bath in this multiple regression. It means that a house with an extra bathroom but the same number of bedrooms, same amount of square footage and built in the same year, is predicted to sell for just over $70,000 more. (The “all else equal” is important.)

This is, you might say, the “value” of an extra bathroom, in that if somebody wanted to remodel a house to add an extra bathroom, which would of course cost something, they could balance the construction cost with this slope to decide whether adding an extra bathroom is worth doing before they sell the house.

Extra: The coefficient for bed is negative, which appears to say that having another bedroom decreases the selling price, but look at the non-significant P-value: the slope coefficient for bed is actually not significantly different from zero. That is, once you know the other things, having an extra bedroom does not significantly increase the selling price.3

(f) Plot the residuals from your regression against the fitted values. What evidence is there that a transformation of the selling prices might be a good idea? (Hint: look at the right side of your graph.)

Solution

Use the fitted model object as if it were a dataframe:

ggplot(price.1, aes(x = .fitted, y = .resid)) + geom_point()

This looks pretty close to random, but if you look at the right side of the graph, you’ll see that the most extreme residuals (the two most positive ones and at least the four most negative ones) are all on the right side and not on the left. There is thus a little evidence of fanning out, which is the sort of thing a transformation of the response variable can help with. The flip side of that is that most of the residuals close to zero are on the left rather than the right, although admittedly there are more points on the left as well.

If you think about prices, this actually makes sense: higher prices are likely to be more variable, because a small change in a larger price is likely to be a larger number of dollars. Put another way, a change in something like the selling price of a house is likely to be expressed as a percentage, and a change of say 10% is a larger number of dollars if the price is larger to begin with.

If a percentage change is meaningful here, the right transformation is to take logs. We’ll see what gets suggested shortly.

(g) Run Box-Cox. What transformation of price is suggested, if any?

Solution

I suggest you copy-paste your lm from above, and change the lm to boxcox (you don’t need to change anything else). Most of what follows you can do by copying, pasting and editing:

boxcox(price ~ area + bath + bed + year_built, data = duke_forest)

The peak of the graph is close to 0.5, and the confidence interval goes from about 0.2 to about 0.8. We are looking for a whole number or a whole number plus a half for our transformation, and the only one that fits the bill is 0.5. Power 0.5 is square root, so that is the transformation we will use.

Note that the graph rules out power 1 (“do nothing”, since raising to a power 1 doesn’t change anything), and also “power 0” (take logs), because these are both outside the interval.

(h) Rerun your regression with a suitably transformed response variable, and display the results.

Solution

This, or use (at least) tidy from broom again. Copy, paste, and insert the sqrt, changing the model number:

price.2 <- lm(sqrt(price) ~ area + bath + bed + year_built, data = duke_forest)
summary(price.2)

Call:
lm(formula = sqrt(price) ~ area + bath + bed + year_built, data = duke_forest)

Residuals:
     Min       1Q   Median       3Q      Max 
-268.536  -45.398    2.849   56.032  215.263 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -178.16596 1055.18390  -0.169 0.866287    
area           0.08654    0.01425   6.075 2.77e-08 ***
bath          52.55888   15.39629   3.414 0.000955 ***
bed          -17.44161   16.56864  -1.053 0.295241    
year_built     0.29488    0.54097   0.545 0.587010    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.08 on 92 degrees of freedom
Multiple R-squared:  0.6056,    Adjusted R-squared:  0.5884 
F-statistic: 35.32 on 4 and 92 DF,  p-value: < 2.2e-16

(i) Confirm that the plot of residuals against fitted values now looks better.

Solution

Copy-paste the previous one and change the previous model to the current one:

ggplot(price.2, aes(x = .fitted, y = .resid)) + geom_point()

The most extreme residuals no longer seem to be in any predictable place (sometimes left, sometimes right), and knowing the fitted value doesn’t seem to say anything about what the residual might be, so I say this is better.

(j) Build a better model by removing any explanatory variables that play no role, one at a time.

Solution

There are two ways to go here: copy-paste your previous regression and literally remove what you want to remove, or use update, which is a bit more concise, and is what I did. The first \(x\)-variable to come out is the one with the highest P-value, which is year_built (the intercept always stays):

price.3 <- update(price.2, .~. -year_built)
summary(price.3)

Call:
lm(formula = sqrt(price) ~ area + bath + bed, data = duke_forest)

Residuals:
     Min       1Q   Median       3Q      Max 
-269.014  -48.644    3.322   53.145  214.244 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 396.41865   47.48641   8.348 6.32e-13 ***
area          0.08617    0.01418   6.079 2.64e-08 ***
bath         54.88481   14.73717   3.724 0.000336 ***
bed         -17.64005   16.50193  -1.069 0.287850    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91.73 on 93 degrees of freedom
Multiple R-squared:  0.6043,    Adjusted R-squared:  0.5916 
F-statistic: 47.35 on 3 and 93 DF,  p-value: < 2.2e-16

bed is still non-significant, so it comes out next:

price.4 <- update(price.3, .~. -bed)
summary(price.4)

Call:
lm(formula = sqrt(price) ~ area + bath, data = duke_forest)

Residuals:
     Min       1Q   Median       3Q      Max 
-265.790  -52.915    0.374   56.264  205.081 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 360.91621   33.96588  10.626  < 2e-16 ***
area          0.08202    0.01364   6.011 3.48e-08 ***
bath         48.71751   13.57120   3.590 0.000528 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91.8 on 94 degrees of freedom
Multiple R-squared:  0.5995,    Adjusted R-squared:  0.5909 
F-statistic: 70.34 on 2 and 94 DF,  p-value: < 2.2e-16

and now everything is significant, so here we stop.

The reason for doing this is that we are trying to understand what selling price really depends upon, and now we have our answer. Machine-learning people say that the model with all four \(x\)s is “over-fitted”, meaning that results from it wouldn’t generalize to other house prices. Now that we have only significant explanatory variables, we can be pretty confident that these will generalize.

(k) If you want to, make a full set of residual plots for your final model (residuals vs fitted values, normal quantile plot of residuals, residuals vs all the explanatory) and convince yourself that all is now at least reasonably good. (I allow for the possibility that you are now bored with this question and would like to move on to something else, but I had already done these, so…)

Solution

Once again, copy, paste, and edit from your previous work. Residuals vs. fitted values:

ggplot(price.4, aes(x = .fitted, y = .resid)) + geom_point()

Normal quantile plot of residuals:

ggplot(price.4, aes(sample = .resid)) + stat_qq() + stat_qq_line()

Residuals vs \(x\)s. This is very like what you did in (b), only now using the residuals rather than the response. Remember that the residuals come from the fitted model object, but the explanatory variable values come from the original data, so the idea is to take the model and augment the data:

price.4 %>% augment(duke_forest) %>% 
  pivot_longer(bed:year_built, names_to = "xname", values_to = "x") %>% 
  ggplot(aes(x = x, y = .resid)) + geom_point() + 
    facet_wrap(~xname, scales = "free")

I think these are all good enough. The residuals vs fitted values looks random; the normal quantile plot has slightly long tails but not really enough to worry about; the residuals vs \(x\)s look random, allowing for the discrete distributions of bath and bed.4

The reason for including the \(x\)s we dropped from the model in the last graph is that any non-linear dependence on these \(x\)s will show up here, and we could go back and add that \(x\) squared if needed. Here, though, there are no patterns at all in the two graphs in the bottom row, so we can be confident that we got it right.

3 Teenage Gambling

A British study investigated the relationship between the amount of money spent on gambling by teenagers, and other variables. The entire variable list is as shown below:

  • sex Male or female as each teenager identified themselves (this is really “gender”, but it was called sex in my source).
  • status Socioeconomic status score based on parents’ occupations
  • income (GB pounds per week)
  • verbal words out of a list of 12 correctly defined
  • gamble expenditure on gambling (GB pounds per year).

The data are in http://ritsokiguess.site/datafiles/teen-gambling.csv. The data come from 1988.

  1. Read in and display (some of) the data. How many teens took part in the survey?

  2. We are interested in whether the amount spent on gambling depends on income, and whether the trend (if any) is different for males and females. Make a graph that will offer some insight into these issues.

  3. What do you learn from your plot, in the context of the data? Talk about trends you see, how they differ between males and females, and any exceptions to the trends.

  4. Fit a suitable analysis of covariance model, and display the output.

  5. Is your interaction term significant? What does that mean in the context of the data? Is that surprising given what you have learned about the data so far? Explain briefly.

  6. Predict, using your ANCOVA model, the gambling expenditures for males and females with incomes 5 and 10 (pounds per week). Do all four combinations. How does this support the significance (or not) of the interaction term in your model? Explain briefly.

Teenage Gambling: my solutions

A British study investigated the relationship between the amount of money spent on gambling by teenagers, and other variables. The entire variable list is as shown below:

  • sex Male or female as each teenager identified themselves (this is really “gender”, but it was called sex in my source).
  • status Socioeconomic status score based on parents’ occupations
  • income (GB pounds per week)
  • verbal words out of a list of 12 correctly defined
  • gamble expenditure on gambling (GB pounds per year).

The data are in http://ritsokiguess.site/datafiles/teen-gambling.csv. The data come from 1988.

(a) Read in and display (some of) the data. How many teens took part in the survey?

Solution

The usual for a csv:

my_url <- "http://ritsokiguess.site/datafiles/teen-gambling.csv"
teens <- read_csv(my_url)
Rows: 47 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (4): status, income, verbal, gamble

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teens

We have the five variables described in the preamble.

There are 47 rows, so 47 teenagers took part in the survey. One of the two points for saying this.

Extra: the data came from the teengamb dataset in the faraway package. My only edit was to label the genders (the original dataset labelled them as 0 and 1).

(b) We are interested in whether the amount spent on gambling depends on income, and whether the trend (if any) is different for males and females. Make a graph that will offer some insight into these issues.

Solution

Two quantitative variables, income plus the amount spent on gambling, and one categorical variable sex. This suggests a scatterplot with the sexes distinguished by colour. We are also looking for trends, so a smooth trend for each would be a good place to start:

ggplot(teens, aes(x = income, y = gamble, colour = sex)) +
  geom_point() + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plotting the regression lines here is not as good, because we don’t yet know whether the trends are straight. If you do that, you will also need to comment in the next part about whether you think the straight lines that you assumed are appropriate.

(d) Fit a suitable model for predicting the amount spent on gambling from income and gender, and display the output.

Solution

This is a regression with a categorical variable. It is especially important to include an interaction term here because we suspect from the graph that the slopes for males and females are not going to be the same:

teens.1 <- lm(gamble ~ income*sex, data = teens)
summary(teens.1)

Call:
lm(formula = gamble ~ income * sex, data = teens)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.522  -4.860  -1.790   6.273  93.478 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)      3.1400     9.2492   0.339  0.73590   
income           0.1749     1.9034   0.092  0.92721   
sexmale         -5.7996    11.2003  -0.518  0.60724   
income:sexmale   6.3432     2.1446   2.958  0.00502 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.98 on 43 degrees of freedom
Multiple R-squared:  0.5857,    Adjusted R-squared:  0.5568 
F-statistic: 20.26 on 3 and 43 DF,  p-value: 2.451e-08

(e) Is your interaction term significant? What does that mean in the context of the data? Is that surprising given what you have learned about the data so far? Explain briefly.

Solution

The phrasing of the question suggests that if you didn’t have an interaction term in your model before, you should go back and put one in!

The interaction term is definitely significant, with a small P-value of 0.005. This means that the slopes of the lines for predicting gambling expenditure for income are different for males and females. One point for this. Specifically, since females are the baseline, the slope of the line for males is larger for males than for females, because the Estimate for the interaction term is positive.

This is not at all surprising given what we saw on the graph: gambling expenditure goes up much faster for males than it does for females. The second point.

(f) (3 points) Predict, using your model, the gambling expenditures for males and females with incomes 5 and 10 (pounds per week). Do all four combinations. How does this support the significance (or not) of the interaction term in your model? Explain briefly.

Predicting for combinations of things ought to make you think of crossing. It’s really the same idea as for the other predictions we’ve done (and for the same reason: to help us understand how the model is behaving). This is how I like to do it:

sexes <- c("female", "male")
incomes <- c(5, 10)
new <- crossing(sex = sexes, income = incomes)
new

and then, do the predictions, and put them beside what they are predictions for:

p <- predict(teens.1, new)
cbind(new, p)

p is a vector rather than a matrix, so bind_cols will also work here. Two points for getting this far.

For females, the predicted gambling expenditure is hardly changing at all as income goes up from 5 to 10. But for males, the increase in gambling expenditure is considerable, going up by about 33 pounds per year (or, if you like, more than doubling, while the females’ increase was less than 25%). This is what the significant interaction term in our model was telling us. One point for making the connection between the different rates of increase here and the meaning of the significant interaction.

4 The boiling point of water: extras

(g):

On the residuals vs. fitted values, the two strange observations show up a long way from the others at the top left. The other residuals also appear to have an upward trend. This is characteristic of the off-trend points being influential: they pull the line closer to themselves than it should really go. Here is the scatterplot from before, but this time with the regression line on it:

ggplot(boiling_point, aes(x = pressure, y = boiling)) + geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

When the off-trend points are unusually high or low (these ones are the lowest), they can have a disproportionate influence over where the line goes. Here, the line is higher on the left and lower on the right than it would be if those two low-pressure points were not there. (A better line then would be steeper than this one, lower on the left and higher on the right.) This has the effect of producing a lot of negative residuals on the left and a lot of positive ones on the right, which explains the trend in the other points on your plot.

The normal quantile plot is actually not bad. I expected those two very positive residuals to look like outliers at the top, and they really don’t; the normal quantile plot suggests a right-skewed distribution of outliers (an upward-opening curve) more than upper-tail outliers.

In this case, it is the residual vs. fitted value plot that really reveals the problem. The fact that there is a problem on one of these plots means that the regression as a whole is not appropriate.

(i): Extra 1:

If you are feeling brave enough, you can support your answers above by drawing a scatterplot with both lines on, to make it clearer as to why they differ in the way they do. There are several ways you might tackle this; here’s mine:

ggplot(boiling_point, aes(x = pressure, y = boiling)) + 
  geom_point(colour = "red") + geom_smooth(method = "lm", linetype = "dashed", se = FALSE) +
  geom_point(data = boiling_point_no_lo) + 
  geom_smooth(data = boiling_point_no_lo, method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

There are some things to be aware of:

  • the ggplot has the original dataframe as its default, so I begin by plotting the points and regression line for that (the first regression).
  • I plot all the data points in red (you’ll see why most of them look black in a moment).
  • the first regression line is drawn dashed, so that we can use the default solid line for the second one. (You could also draw it a different colour.) I removed the default grey “envelope”, since the plot will be busy enough.
  • then I plot the points of the second data set (without the two low-pressure ones). This is a different data set than in the original ggplot, so I have to use data = inside. The points will be the default black, and the x and yfor the plot are still pressure and boiling, so I don’t need to say anything else. What this does is to draw black points over the top of the red ones we drew in the first geom_point, for all the points in the second data set. This leaves the low-pressure points red.
  • finally, I add the regression line for the second data set. I have to tell ggplot again to use the second data set, or else it will use the one in the ggplot statement. (I was caught by this the first time I made this graph, and only got one line.) Note that ggplot will not let you extrapolate; the lines only go as far as the most extreme \(x\)-values in the data set they are based on.

This shows how the first line is pulled up at the low end by the two red points, which are still quite far from the (dashed) line. Hence the second line will have a bigger (more positive) slope, and the remaining points are closer to it, so it will have a higher R-squared as well.

Extra 2: I didn’t ask for the residual plots for the second regression. I hope there will be no problems with them now.6 But in case you are interested:

ggplot(boiling_point.2, aes(x = .fitted, y = .resid)) + geom_point()

That is kind of odd, with the upward trend, the point at the bottom, then the downward trend. Let’s try to investigate more:

boiling_point.2 %>% augment(boiling_point_no_lo)

The odd one at the bottom of the plot is the 10th observation, observed 204.6 and predicted 205.8. I think what has happened is that, having removed those two very off-trend points on the left, we now have an off-trend point in the middle. This is a legit observation, though, so we are stuck with it. As the two off-trend points on the left did, it pulls the line (slightly) towards itself (down) and away from the line that we would get if we took this point out.7 In the same way that we got a funny trend in the other points in the residual plot from the first regression, what this one says is that if you were to take this point out, you’d have a different trend for the other points again. To go back and look at my plot (the one with both regression lines on it), the impression seems to be that if you fitted a regression line to just the points to the left of obs. 10, it would have a steeper slope, and if you fitted one to just the points to the right of obs. 10, it would be less steep. That would explain the pattern on the plot.8 I think I have that right. Feel free to check by working it out yourself.

The normal quantile plot:

ggplot(boiling_point.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()

That seems to say long tails, but the very negative residual, that observation we were just talking about, really stands out. If you look back at the scatterplot, you can now see that it stands out there as well, in the middle of the plot and clearly below the line.

Footnotes

  1. You only need enough decimals to make the comparison clear. Two is enough, but I went with three.↩︎

  2. I back off on this a bit later, but it’s the basic idea.↩︎

  3. If you didn’t know the other things, the number of bedrooms would undoubtedly have an impact, but it does not, over and above those other things.↩︎

  4. You can see that some of the houses have 2.5 bathrooms, and you might be wondering how that can be. A half-bathroom, according to this, is a bathroom without an actual bath (or shower), typically a toilet and washbasin. You might have one on the main floor of the house, to save you or your guests having to go upstairs.↩︎

  5. The major exception to the straightness is at the right end, caused by the two teenagers with income 15 spending less on gambling than you might expect. I wouldn’t get too hung up on something caused by only two observations.↩︎

  6. I may have been too optimistic.↩︎

  7. Which we are not going to do.↩︎

  8. One of the reasons that I didn’t have you draw this plot was that the question was long enough already; another is that I didn’t want to make you grapple with this.↩︎