Worksheet 11

Published

November 20, 2025

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial. I wanted to give you some extra practice, so there are three multiple regression scenarios. There will be some function-writing practice on Worksheet 12, which is not attached to a tutorial.

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(MASS, exclude = "select")
library(leaps)

Behavior Risk Factor Surveillance System

The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:

  • FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
  • SmokeEveryday: the percent of respondents who smoke every day.
  • EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.

  1. Read in and display (a little of) the data.

Solution

A simple read_csv:

my_url <- "http://ritsokiguess.site/datafiles/brfss_no_utah.csv"
brfss <- read_csv(my_url)
Rows: 49 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (30): Age18_24, Age25_34, Age35_44, Age45_54, Age55_64, Age65orMore, EdL...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
brfss

There are 49 rows, one for each state apart from Utah. There are a lot of columns, 31 altogether. You can check for yourself, by looking or as below, that the columns I named above do actually exist. There is no need to “get rid” of the other columns, unless you have a strong desire to do so.

Extra: the “automated” way to see whether you have all the columns you need is to try to select them, copying and pasting their names from the question. I’m including the state names as well, since they are identifiers:

brfss %>% select(State, FruitVeg5, SmokeEveryday, EdCollege)

If you think it’s better, once you’ve done this, you can save the cut-down dataframe with -> after the select, and use that the rest of the way. We will be using the state names later, so reading the rest of the question to make sure of what you’ll need later is smart.

\(\blacksquare\)

  1. Make a suitable plot of the percent of people eating five fruits and vegetables every day against the percent smoking every day. Add a smooth trend if appropriate.

Solution

All the variables in this question are quantitative, so all the plots will be scatterplots. The response variable is FruitVeg5, so that goes on the \(y\) (vertical) axis. One of the main purposes of a scatterplot is to discover whether there is a relationship between the variables, and if so, what kind of relationship, so a smooth trend makes sense:

ggplot(brfss, aes(y = FruitVeg5, x = SmokeEveryday)) + geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

At this point, a smooth trend is better than a regression line, because we don’t know whether the relationship is straight or not.

\(\blacksquare\)

  1. Describe any trends you see on your picture (think form, direction, strength).

Solution

I see:

  • form: more or less linear (no real suggestion of a curve,1 especially given the (considerable) scatter of the points)
  • direction: a downward trend
  • strength: weak to moderate (after all, there clearly is a downward trend; it just isn’t very strong).

Your opinions might differ (especially about strength), and that can be fine as long as you have given some thought to what you see and have come up with adjectives that describe that.

\(\blacksquare\)

  1. Draw a scatterplot of fruit and vegetable consumption vs rate of college graduation. Describe what you see on your scatterplot (form, direction, strength).

Solution

Copy the code, change the variable name:

ggplot(brfss, aes(y = FruitVeg5, x = EdCollege)) + geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Here, I see:

  • form: more or less linear. The curve at the bottom end seems to be caused by that one state that has an unusually high FruitVeg5 value (over 25) despite having the second-lowest EdCollege value. So I am not going to worry about that.2 You might also note a “kink” in the trend at an EdCollege value around 35. In that case, you might say that the best description of the form of the trend is two straight lines glued together, rather than a curve.
  • direction: an upward trend.
  • strength: I would call this a moderately strong trend: a bit stronger than the other one, since the points seem to be overall closer to the trend. (This is a judgement call.)

\(\blacksquare\)

  1. Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.

Solution

This:

brfss.1 <- lm(FruitVeg5 ~ EdCollege + SmokeEveryday, data = brfss)
summary(brfss.1)

Call:
lm(formula = FruitVeg5 ~ EdCollege + SmokeEveryday, data = brfss)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5881 -1.4681  0.3668  1.8217  6.5814 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    22.1165     5.9015   3.748 0.000497 ***
EdCollege       0.2351     0.1044   2.253 0.029066 *  
SmokeEveryday  -0.4359     0.2033  -2.144 0.037371 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.747 on 46 degrees of freedom
Multiple R-squared:  0.4266,    Adjusted R-squared:  0.4017 
F-statistic: 17.11 on 2 and 46 DF,  p-value: 2.783e-06

Or this, if you prefer:

glance(brfss.1)
tidy(brfss.1)

\(\blacksquare\)

  1. Look at the two numbers in the Estimate column, not including the intercept. Do their signs, positive or negative, makes sense in the context of the data, based on what you know or can guess? Explain briefly.

Solution

The slope for EdCollege is positive. This means that a state with a higher proportion of college graduates will also tend to have a higher proportion of people who eat their fruit and veg, holding smoking constant. This makes sense because you would expect people who have completed college to be more educated in general, and you’d expect such people to know about the importance of a good diet.

The slope for SmokeEveryday is negative. This means that a state with more smokers will have fewer people eating fruits and vegetables. This is a direct health link: people that eat fruits and vegetables care about their health, and people who smoke do not, so in the data we tend to have states whose inhabitants are more or less healthy.

\(\blacksquare\)

  1. What does the last number in the rightmost column of the regression output tell us? Does this make sense in the light of graphs you have drawn? Explain briefly.

Solution

On my output, that number is 0.037. This is the P-value for testing whether the slope for SmokeEveryday is zero or not zero, or, equivalently, whether SmokeEveryday has anything to add to a regression that already contains EdCollege. (Or, also equivalently, whether SmokeEveryday could be removed from this regression.) This P-value is less than 0.05 and thus significant, and so the slope is definitely not zero (or, the explanatory variable does have something to add, or should not be removed.)

I was hoping that you would add the explanatory variables in the same order that I did. If you had the explanatory variables the other way around, you will be referring to the other one and the other plot you drew.

Think back to the adjective you used earlier to describe the strength of the relationship on your scatterplot, “weak” or “moderate” or something like that. The relationship didn’t look especially convincing, but the P-value says that it is more convincing than chance. (If there were no relationship, it would look more random than this.) You are entitled to be surprised that the relationship is significant, given the appearance of the scatterplot.

\(\blacksquare\)

  1. Draw a complete set of residual plots for this regression (four plots). Do you have any concerns? Explain briefly.

Solution

Residuals against fitted values, normal quantile plot of residuals, residuals against each of the two explanatory variables.

I’m going to do the augment thing first, to avoid warnings about fortify:

brfss.1a <- augment(brfss.1)

Residuals against fitted values:

ggplot(brfss.1a, aes(x = .fitted, y = .resid)) + geom_point()

Pretty much random. That point top left with the biggest (most positive) residual is the only one to be at all worried about.

Normal quantile plot of residuals:

ggplot(brfss.1a, aes(sample = .resid)) + stat_qq() + stat_qq_line()

On the line, really, all the way along. I wouldn’t stress about those three most negative residuals; they are really not far below the line. Also: that most positive residual, which shows up on the other plots as “unusually large”, is indeed quite a bit bigger than the second largest one, but it is close to the line here, so it is pretty much the size you’d expect the largest residual to be (and thus ought not to be a big concern).

Residuals against explanatory variables. This is the point where you need to use augment if you didn’t before. The best way is the pivot-longer way to get the plots all at once:

brfss.1a %>% 
  pivot_longer(EdCollege:SmokeEveryday, names_to = "xname", values_to = "xval") %>% 
  ggplot(aes(x = xval, y = .resid)) + geom_point() + 
  facet_wrap(~ xname, scales = "free")

Be prepared for embarrassingly stupid plots if you forget the scales (as I did).

Alternatively, plot one at a time (not so bad here because there are only two explanatory variables):

ggplot(brfss.1a, aes(x = EdCollege, y = .resid)) + geom_point()

ggplot(brfss.1a, aes(x = SmokeEveryday, y = .resid)) + geom_point()

These are not bad. You could choose to be bothered by that point top left on the EdCollege plot and top right on the SmokeEveryday plot, which seems to have an unusually large (positive) residual.

I would declare myself to be happy overall. There are no systematic problems, such as graphs that are completely the wrong shape, and the normal quantile plot indicates that the “large” positive residual is not actually too large.

Extra: of course, I wanted to find out which state had that big residual. People in that state were more likely to eat the five fruits and vegetables than you would have expected from knowing how many of them smoked (a lot) or had a college education (not many):

brfss.1a %>% 
  mutate(state_name = brfss$State) %>% 
  slice_max(.resid, n = 1) %>% 
  select(state_name, everything())

Tennessee. 26.4% were observed to eat their five fruits and vegetables, but the prediction is only 19.8%. Maybe there are other variables on which this state is unusual. The one that jumps out at me is how few people aged 18–24 there are. I’d have to do some digging to find out what those other variables actually are.

Code-wise: I realized I needed to add the state names (getting them from the original data, since brfss.1a didn’t have them), then I grabbed the one largest (most positive) residual, then I displayed everything with the state name first. (everything() in this context means “everything else”.)

If you want to investigate yourself, here is the survey website, and the 2019 survey questionnaire. The data are collected and recorded at an individual level (with nothing identifying the person who responded), and then aggregated and summarized to produce the data that you saw. The textbook I got the data from was published in 2012, so the data would come from an earlier questionnaire.

\(\blacksquare\)

Construction projects

How long does it take to complete a large construction project? This might depend on a number of things. In one study of this issue, fifteen construction projects were investigated (as a random sample of “all possible construction projects”). Five variables were measured:

  • time taken to complete the project (days)
  • size of the contract (thousands of dollars)
  • number of work days affected by bad weather
  • whether or not there was a worker’s strike during the construction (1 = yes, 0 = no)
  • number of subcontractors involved in the project.

Subcontractors are people like electricians, plumbers, and so on, that are hired by the company overseeing the whole project to complete specific jobs. A large project might have a number of subcontractors coming in at different times to do parts of the work.

The data are in http://ritsokiguess.site/datafiles/construction.csv.

  1. Read in and display (some of) the data.

Solution

The usual for a .csv:

my_url <- "http://ritsokiguess.site/datafiles/construction.csv"
construction <- read_csv(my_url)
Rows: 15 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): completion_time, size, bad_weather, strike, subcontractors

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
construction

15 rows (projects) and 5 (quantitative) variables. (strike is really categorical, but we are treating it as quantitative here, which is all right because it has only two categories.)

\(\blacksquare\)

  1. Fit a suitable regression, predicting the response variable from everything else, and display the results.

Solution

Predict completion time from everything else:

construction.1 <- 
  lm(completion_time ~ size + bad_weather + strike + subcontractors, 
     data = construction)
summary(construction.1)

Call:
lm(formula = completion_time ~ size + bad_weather + strike + 
    subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6284  -5.2983   0.1322   5.7912  17.6356 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -1.720891  11.647578  -0.148   0.8855  
size           -0.007873   0.006226  -1.265   0.2347  
bad_weather     0.670164   0.999093   0.671   0.5176  
strike         27.713734  11.363613   2.439   0.0349 *
subcontractors  3.534685   1.933829   1.828   0.0975 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.84 on 10 degrees of freedom
Multiple R-squared:  0.8469,    Adjusted R-squared:  0.7857 
F-statistic: 13.83 on 4 and 10 DF,  p-value: 0.0004398

\(\blacksquare\)

  1. Build a good model for predicting completion time, using backward elimination with \(\alpha = 0.10\). Describe your process.

Solution

This means to eliminate useless explanatory variables, starting with the least significant (but leaving the intercept in, because that always stays). You’ll see at the end why I said to use \(\alpha\) of 0.10 rather than the usual 0.05.

The first step is to remove the explanatory variable with the largest P-value, which is bad_weather. There are two ways to fit a model that is a small difference from a previous one. The way that came to you is probably copy, paste and edit:3

construction.2a <- 
  lm(completion_time ~ size + strike + subcontractors, 
     data = construction)
summary(construction.2a)

Call:
lm(formula = completion_time ~ size + strike + subcontractors, 
    data = construction)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.620  -6.707  -1.041   4.628  21.424 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)     0.481240  10.892336   0.044   0.9656  
size           -0.007178   0.005983  -1.200   0.2555  
strike         29.137658  10.880871   2.678   0.0215 *
subcontractors  3.964190   1.778536   2.229   0.0476 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 11 degrees of freedom
Multiple R-squared:   0.84, Adjusted R-squared:  0.7964 
F-statistic: 19.26 on 3 and 11 DF,  p-value: 0.0001099

The other way, which I think is better,4 is to use update, which takes a model to update, and some instructions for how to update it:

construction.2 <- update(construction.1, . ~ . - bad_weather)
summary(construction.2)

Call:
lm(formula = completion_time ~ size + strike + subcontractors, 
    data = construction)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.620  -6.707  -1.041   4.628  21.424 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)     0.481240  10.892336   0.044   0.9656  
size           -0.007178   0.005983  -1.200   0.2555  
strike         29.137658  10.880871   2.678   0.0215 *
subcontractors  3.964190   1.778536   2.229   0.0476 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 11 degrees of freedom
Multiple R-squared:   0.84, Adjusted R-squared:  0.7964 
F-statistic: 19.26 on 3 and 11 DF,  p-value: 0.0001099

The idea is that someone reading your code will immediately see that this is taking a (large) model and making a small change to it. On the other hand, unless you explicitly describe what is going on, the copy-paste-edit approach will make your reader wonder “how is construction.2a different from construction.1 again?” and they will have to go through and check how the two models are different. This is work that you need to be doing, not your reader.

Code-wise, the second input to update says “keep the response variable the same, and keep all the explanatory variables the same except take out bad_weather”.

Having gotten to this point, look again to see that size is still not significant, so that can come out as well (but it looks as if the other two will have to stay):

construction.3 <- update(construction.2, . ~ . - size)
summary(construction.3)

Call:
lm(formula = completion_time ~ strike + subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1842  -5.6759   0.8662   5.8523  19.8045 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.168     10.352   0.499   0.6267  
strike           29.163     11.078   2.633   0.0219 *
subcontractors    2.989      1.610   1.856   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.75 on 12 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.789 
F-statistic: 27.17 on 2 and 12 DF,  p-value: 3.502e-05

At \(\alpha = 0.10\), this is where we stop, because strike and subcontractors are both significant (at this level). (This is why I put the value in for \(\alpha\); otherwise you would need to take out subcontractors as well.)

I actually copied, pasted and edited the update line here.

Extra 1: This could also be done using step, but since that is an automatic process, you will need to describe what is going on:

construction.4 <- step(construction.1, direction = "backward", test = "F")
Start:  AIC=78.05
completion_time ~ size + bad_weather + strike + subcontractors

                 Df Sum of Sq    RSS    AIC F value  Pr(>F)  
- bad_weather     1     63.04 1464.2 76.715  0.4499 0.51756  
<none>                        1401.1 78.055                  
- size            1    224.07 1625.2 78.280  1.5993 0.23468  
- subcontractors  1    468.10 1869.2 80.378  3.3409 0.09752 .
- strike          1    833.35 2234.5 83.056  5.9478 0.03492 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=76.71
completion_time ~ size + strike + subcontractors

                 Df Sum of Sq    RSS    AIC F value  Pr(>F)  
- size            1    191.54 1655.7 76.559  1.4391 0.25549  
<none>                        1464.2 76.715                  
- subcontractors  1    661.27 2125.4 80.305  4.9680 0.04762 *
- strike          1    954.50 2418.7 82.244  7.1710 0.02149 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=76.56
completion_time ~ strike + subcontractors

                 Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>                        1655.7 76.559                  
- subcontractors  1    475.21 2130.9 78.344  3.4442 0.08819 .
- strike          1    956.17 2611.9 81.397  6.9301 0.02187 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First bad_weather is removed, and then size. At this point, it is best to remove nothing else, leaving subcontractors and strike.

If you saved the result of step, you’ll get the final fitted model, which you can then look at:

summary(construction.4)

Call:
lm(formula = completion_time ~ strike + subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1842  -5.6759   0.8662   5.8523  19.8045 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.168     10.352   0.499   0.6267  
strike           29.163     11.078   2.633   0.0219 *
subcontractors    2.989      1.610   1.856   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.75 on 12 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.789 
F-statistic: 27.17 on 2 and 12 DF,  p-value: 3.502e-05

the same as before.

Extra 2: Another way to go is to look at the output from construction.1 again:

summary(construction.1)

Call:
lm(formula = completion_time ~ size + bad_weather + strike + 
    subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6284  -5.2983   0.1322   5.7912  17.6356 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -1.720891  11.647578  -0.148   0.8855  
size           -0.007873   0.006226  -1.265   0.2347  
bad_weather     0.670164   0.999093   0.671   0.5176  
strike         27.713734  11.363613   2.439   0.0349 *
subcontractors  3.534685   1.933829   1.828   0.0975 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.84 on 10 degrees of freedom
Multiple R-squared:  0.8469,    Adjusted R-squared:  0.7857 
F-statistic: 13.83 on 4 and 10 DF,  p-value: 0.0004398

and say “well, it looks as if I can remove both size and bad_weather from this. This is dangerous, though, because the P-values only apply to the removal of one explanatory variable from the model you are looking at. You can do it, but you must follow up with a test that checks whether removing both those two explanatory variables was OK.

The first step is to fit the model without the two explanatory variables you want to remove:

construction.5 <- update(construction.1, . ~ . - size - bad_weather)
# summary(construction.5)

The obligatory next step is compare the fits of the two models using anova, smaller model first:

anova(construction.5, construction.1, test = "F")

This is not significant, so the two models fit equally well, and therefore we should go with the smaller, simpler model. That is to say, in this case it was all right to take out both explanatory variables, but we needed to do this test to be sure.

If the anova result is significant, that means that the smaller model fits significantly worse than the bigger one, and that taking out all the things you took out was a mistake. If this happens to you, you need to go back and remove fewer explanatory variables; if you had gone with one at a time, you would probably have found that one of them became significant during the process, and you would have ended up not removing that one.

By way of sanity-checking, the R-squared for the model with everything in (my construction.1) was 84.7%, and for the model with only strike and subcontractors was 81.9%. Taking out those two explanatory variables made only a small difference to R-squared, so it was probably a good idea (but you need to do the test to be sure).

Extra 3: This is where we got to before:

summary(construction.3)

Call:
lm(formula = completion_time ~ strike + subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1842  -5.6759   0.8662   5.8523  19.8045 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.168     10.352   0.499   0.6267  
strike           29.163     11.078   2.633   0.0219 *
subcontractors    2.989      1.610   1.856   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.75 on 12 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.789 
F-statistic: 27.17 on 2 and 12 DF,  p-value: 3.502e-05

What happens if we take out subcontractors now, as we would if working at \(\alpha = 0.05\)?

construction.6 <- update(construction.3, . ~ . - subcontractors)
summary(construction.6)

Call:
lm(formula = completion_time ~ strike, data = construction)

Residuals:
   Min     1Q Median     3Q    Max 
-19.00  -8.55   1.00   6.45  21.90 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.100      4.049   5.706 7.22e-05 ***
strike        45.900      7.012   6.545 1.87e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.8 on 13 degrees of freedom
Multiple R-squared:  0.7672,    Adjusted R-squared:  0.7493 
F-statistic: 42.84 on 1 and 13 DF,  p-value: 1.867e-05

R-squared has gone down by almost 5 percentage points, which might be starting to concern you as being too much. Also, adjusted R-squared has gone down. So maybe removing subcontractors, even though it wasn’t quite significant by the usual standards, was a bit too much. (At least strike is now strongly significant, and there is no doubt that this has to stay.)

\(\blacksquare\)

  1. For your best model, obtain a full set of residual plots. This means:
  • residuals vs fitted values
  • normal quantile plot of residuals
  • residuals vs each of the explanatory variables.

Solution

I start by running augment and using the output from that for all the plots, to save myself getting a warning:

construction.3a <- augment(construction.3, data = construction)
construction.3a

This time I am using the data = thing to make sure I have all the variables from the original dataset, including the ones I removed (because I want to plot the residuals against them as well). I displayed the augmented dataframe so you can see what it has in it.

Then:

ggplot(construction.3a, aes(x = .fitted, y = .resid)) + geom_point()

ggplot(construction.3a, aes(sample = .resid)) + stat_qq() + stat_qq_line()

construction.3a %>%  
  pivot_longer(size:subcontractors, 
               names_to = "xnames", values_to = "xvals") %>%
  ggplot(aes(x = xvals, y = .resid)) + geom_point() +
  facet_wrap( ~ xnames, scales = "free")

If you prefer, plot the residuals against the \(x\)s one by one, but the facets idea is best. (You could also do it one by one if you only had two explanatory variables, but if you have more than that, the pivot-longer approach is definitely better.)

\(\blacksquare\)

  1. Comment briefly on your plots, and how they support the regression being basically satisfactory. You may assume that all the data values were correctly recorded.

Solution

Remember that having two clouds of points is not in itself a bad thing; what matters is whether changing the value of the variable on the \(x\)-axis tells you anything about what the residual is going to be. Another way of saying that the residuals are random is that they are unpredictable given anything else.

  • There are about five fitted values that are clearly larger than the others, but they have residuals that average out to zero and have about the same spread as the residuals for the other fitted values. Said differently, there are two random clouds of residuals, with no pattern, either within each cloud or between the two clouds. I don’t really think this is enough for “fanning out”.

  • The points on the normal quantile plot are close to the line. Only the most negative residual is at all off the line (it is more negative than expected). (This is not really an “outlier” residual, since it doesn’t make sense to talk about residuals as having outliers, though it might be a residual that indicates a possible outlying data value. More on this later.)

  • in the plot against bad_weather, my only concern is the largest residual that happens to be top right on this plot: the job on which the actual construction time was much longer than predicted was when there was the most bad weather. The other points seem more or less random. You might imagine that on construction jobs, a little bad weather is allowed for in the planning, but when there is a lot, it can throw things badly off.

  • size has a strange pattern: small jobs have a desirable pattern of residuals, but for the six or so biggest jobs, the residuals increase with size and then sharply decrease. The very biggest job actually took a lot less time than expected.

  • strike only takes two values 0 and 1, but the residuals attached to each of these values seem to be random for each, averaging out to zero with similar spreads.

  • the pattern of residuals against subcontractors has an odd shape: there may be a curved trend in all the points except the one bottom right, and a possible outlier (the point bottom right). Without the outlier, I would try a squared term in subcontractors, but with it, it is difficult to know what to do, and difficult to say that something is more wrong than chance. The last sentence of the question rules out the possibility of removing the observation with the very negative residual: it is correct, so we have to include it in the model.

That observation with the very negative residual is the apparent outlier on the last plot, the one that is most off the line in the normal quantile plot, and is visible on the other plots as well. I told you that this outlier (and all the other observations) were correctly recorded, so there is no reason to remove this observation.

These last plots indicate that the regression is not really satisfactory, and that we should consider transforming some of the explanatory variables, or maybe seeing why those strange observations, such as the one with a very negative residual, are the way they are.

Extra: which observation is that one with the very negative residual?

construction.3a %>% 
  slice_min(.resid)

How does that compare with the rest of the data?

summary(construction)
 completion_time      size         bad_weather         strike      
 Min.   : 7.0    Min.   :  50.0   Min.   : 3.000   Min.   :0.0000  
 1st Qu.:18.0    1st Qu.:  72.5   1st Qu.: 6.000   1st Qu.:0.0000  
 Median :30.0    Median : 100.0   Median : 8.000   Median :0.0000  
 Mean   :38.4    Mean   : 415.0   Mean   : 9.467   Mean   :0.3333  
 3rd Qu.:55.0    3rd Qu.: 400.0   3rd Qu.:13.000   3rd Qu.:1.0000  
 Max.   :90.0    Max.   :2600.0   Max.   :20.000   Max.   :1.0000  
 subcontractors  
 Min.   : 3.000  
 1st Qu.: 5.500  
 Median : 8.000  
 Mean   : 7.867  
 3rd Qu.:10.500  
 Max.   :13.000  

This is a big project: the size and number of subcontractors are the largest of all. There was also a strike during this project, which is important because that was in our best model. The negative residual means that despite all this, the project was completed in less time than predicted (50 days vs. predicted 73 days). There might have been something else about this project that had an impact on the completion time (that we don’t know about), for example the people running this project could have paid the subcontractors extra to complete their work sooner (but still, we hope, safely and in compliance with building codes).

\(\blacksquare\)

Forced expiratory volume

One way to measure how well someone is breathing is to measure their “forced expiratory volume” (FEV), which is how much air (in litres) you can expel from your lungs in one second. If the FEV is too low, this may indicate difficulties in breathing. A doctor wanted to see whether children whose parents smoked at home had difficulties breathing (as measured by the FEV). The data are in http://ritsokiguess.site/datafiles/fev.csv. The doctor also recorded the child’s age (in years) and height (in cent), as well as the child’s gender (here male or female), along with whether or not the child had been exposed to smoking in the home (in the column Smoke). 654 children were observed.

  1. Read in and display (some of) the data.

Nothing new here:

my_url <- "http://ritsokiguess.site/datafiles/fev.csv"
fev <- read_csv(my_url)
Rows: 654 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Gender, Smoke
dbl (3): Age, FEV, Ht

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fev

I should probably have chosen a different name for my dataframe. However, R usually distinguishes between uppercase and lowercase, so it is actually all right to have a dataframe called fev and a column in it called FEV. If rather confusing.

There are 654 rows (children) and the columns as promised.

  1. Fit a regression predicting forced expiratory volume from all the other variables. Why should you run drop1? Do so for this model.
fev.1 <- lm(FEV ~ Age + Ht + Gender + Smoke, data = fev)
drop1(fev.1, test = "F")

We should run drop1 rather than starting with summary because some of our explanatory variables are categorical, and this will tell us whether any of them can be removed as a whole, whereas summary only tells us about the effects of particular levels of the categorical variable.

  1. What does your drop1 output tell you? In particular, does it address the doctor’s concern?

My drop1 is a tidy-style table, so all the P-values (in the last column) are in scientific notation. Yours might look more like the summary output does.

The P-values for age, height, and gender are all much less than 0.05, and so these cannot be removed from the model. The P-value for Smoke, however, is 0.14, not significant. This says that even after you account for age, height, and gender, whether the child was exposed to smoke at home does not help you predict FEV, and so the doctor’s suspicion about the effects of smoking is not supported by these data.

  1. Look at the summary output for your model, and interpret the Estimates for Ht and Gender.

Get the summary first:

summary(fev.1)

Call:
lm(formula = FEV ~ Age + Ht + Gender + Smoke, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.37656 -0.25033  0.00894  0.25588  1.92047 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.456974   0.222839 -20.001  < 2e-16 ***
Age          0.065509   0.009489   6.904 1.21e-11 ***
Ht           0.104199   0.004758  21.901  < 2e-16 ***
Gendermale   0.157103   0.033207   4.731 2.74e-06 ***
Smokeyes    -0.087246   0.059254  -1.472    0.141    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4122 on 649 degrees of freedom
Multiple R-squared:  0.7754,    Adjusted R-squared:  0.774 
F-statistic:   560 on 4 and 649 DF,  p-value: < 2.2e-16
  • Ht is a quantitative variable, so the interpretation is as usual: an increase of 1 inch in height, with everything else remaining the same, is predicted to result in an increase of 0.104 in FEV. (This makes sense because a taller child will probably have bigger lungs.)

  • Gender is a categorical variable. This means that the interpretation is of how the category shown (male) compares with the baseline category (female, the first one alphabetically). Specifically, a boy is predicted to have a FEV of 0.157 greater than a girl of the same age, height and parental smoking status.

Extra: if you look carefully, you’ll see that the P-values for Gender and Smoke are the same in the summary and drop1 tables. This is because we are in the special situation of both our categorical variables having only two categories, and so a test of the second category vs. the baseline is the same as a test of the significance of the whole categorical variable. That said, I think it is still easier to remember that if you have any categorical explanatory variables, you assess significance by looking at drop1 rather than summary.

The reason for looking at summary is what we just did: to understand the significant effects (in this case, only the summary table tells us that there is a positive association between age and FEV, and also between height with FEV, and that boys have greater FEV than girls, all else equal.

  1. Make a complete set of residual plots for this model. Do you have any concerns? (Hint: the procedure you know doesn’t like a mixture of quantitative and categorical variables, because it’s trying to put text and numbers into the same column. Do the quantitative variables first, and then do the categorical ones.)

I’m doing the augment thing first again. All the explanatory variables are in this model (we didn’t remove Smoke), so the simple augment is fine:

fev.1a <- augment(fev.1)
fev.1a

I’m displaying the result so that I do the pivots-longer5 correctly in a moment.

ggplot(fev.1a, aes(x = .fitted, y = .resid)) + geom_point()

That looks like a big fan-out problem: when the fitted value is larger, the predictions are less accurate. Real fan-out problems are usually about as obvious as this one is.

ggplot(fev.1a, aes(sample = .resid)) + stat_qq() + stat_qq_line()

Long tails, slightly.

Residuals vs \(x\)s, quantitative first:6

fev.1a %>% 
  pivot_longer(Age:Ht, names_to = "xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~ xname, scales = "free")

These both have an element of fanning out as well. It might be that fixing up the response will sort these out too.

Then categorical:

fev.1a %>% 
  pivot_longer(Gender:Smoke, names_to = "xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~ xname, scales = "free")

With these, you are looking for vertical strips of points that average out to zero, with about the same spread for all categories. With so many observations, it’s not obvious whether we have that or not, so an alternative is to use boxplots for the categorical variables, like this:7

fev.1a %>% 
  pivot_longer(Gender:Smoke, names_to = "xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = .resid)) + geom_boxplot() +
  facet_wrap(~ xname, scales = "free")

Having drawn those, I’m still not sure. Maybe the male residuals are more variable than the female ones. In the Smoke facet, Yes has a taller box, but No has outliers both ends.

  1. Investigate a transformation of FEV. Are the results consistent with your residual plots?

The fanning-out on the plot of residuals vs fitted values (and elsewhere) suggests that the larger values need to be brought down, which implies that a transformation like log or square root might be helpful. You could propose one of those, but better is to fire up Box-Cox to see what the possibilities are:

boxcox(FEV ~ Age + Ht + Gender + Smoke, data = fev)

This strongly suggests a transformation of log (that is, that \(\lambda = 0\)). Doing nothing is ruled out (1 is well outside the confidence interval) and also square root is not supported by the data (0.5 is also outside the confidence interval). We have a lot of observations, so this plot is telling us pretty clearly what to do.

The fanning out on the residual plots suggests that higher values are being predicted less accurately, so we need to bring the higher values down and make them closer together, which is what a log transformation will do.

  1. Fit a regression suggested by the results of your previous question, and display the drop1 and summary output. Has anything changed from your previous work? Explain briefly.

Replace FEV with the log of FEV, and copy-paste-edit previous work:

fev.2 <- lm(log(FEV) ~ Age + Ht + Gender + Smoke, data = fev)
drop1(fev.2, test = "F")
summary(fev.2)

Call:
lm(formula = log(FEV) ~ Age + Ht + Gender + Smoke, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63278 -0.08657  0.01146  0.09540  0.40701 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.943998   0.078639 -24.721  < 2e-16 ***
Age          0.023387   0.003348   6.984  7.1e-12 ***
Ht           0.042796   0.001679  25.489  < 2e-16 ***
Gendermale   0.029319   0.011719   2.502   0.0126 *  
Smokeyes    -0.046068   0.020910  -2.203   0.0279 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1455 on 649 degrees of freedom
Multiple R-squared:  0.8106,    Adjusted R-squared:  0.8095 
F-statistic: 694.6 on 4 and 649 DF,  p-value: < 2.2e-16

The smoking variable is now significant, with a P-value of 0.028. Looking at the summary, it has a negative Estimate, which means that children who were exposed to smoking at home have a lower log-FEV compared to children who were not (baseline), all else equal. So there is an effect of smoking after all, in the direction the doctor suspected, and the doctor is (finally) vindicated.

Extra: everything is now significant, so a graph is liable to be an oversimplification, but I wanted to show you a graph that uses the marginaleffects package, which we will see again in D29:

library(marginaleffects)
plot_predictions(fev.2, condition = c("Ht", "Smoke"))

These predictions are actually of log-FEV, which the vertical axis should show. I have shown how the predictions depend on height and smoking. The plot uses an average value for age, and the more common of the two genders:

fev %>% summarize(mean_age = mean(Age))
fev %>% count(Gender)

so boys, but the picture is similar for girls. The oversimplification is that taller children are almost certainly older, not shown here.

The plot shows that log-FEV goes up sharply with the child’s height (the taller they are, the bigger their lungs), but there is also a small but clear effect of being exposed to smoking: children not exposed are predicted to have a slightly higher log-FEV than those that were exposed, and that small effect is significant.

This graph goes to show that the effect of height (and presumably also of age) is very large, and the effect of being exposed to smoking could easily be hidden by it. The value of looking at a multiple regression is once again that we see the effect of each variable over and above the effect of the others; being exposed to smoking has a definite effect over and above the effects of the other things.

Extra extra: it would also be a good idea to check that the residual plots are now satisfactory, but it is late, I am tired, and I have to teach you guys tomorrow.

(later, as in Thursday afternoon) now I realize that there would be a lot of copying and pasting, and a small amount of editing, so let’s go for it:

fev.2a <- augment(fev.2)
fev.2a
ggplot(fev.2a, aes(x = .fitted, y = .resid)) + geom_point()

As random as you could want.

ggplot(fev.2a, aes(sample = .resid)) + stat_qq() + stat_qq_line()

Left-skewed, or a long lower tail. There are a few too many residuals that are too negative, now.

Residuals vs \(x\)s, quantitative first:

fev.2a %>% 
  pivot_longer(Age:Ht, names_to = "xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~ xname, scales = "free")

These are now appropriately random. Ages are only whole years, hence the vertical strips.

Then categorical, going straight to boxplots:

fev.2a %>% 
  pivot_longer(Gender:Smoke, names_to = "xname", values_to = "xval") %>%
  ggplot(aes(x = xval, y = .resid)) + geom_boxplot() +
  facet_wrap(~ xname, scales = "free")

These all average out close to zero and have more or less the same spread (as measured by the boxes). On these, the boxes within the same facet are the ones that need to be of about equal height. In this case, though, all four boxes look to have equal height. I am not too worried by the outliers; when you have a lot of observations, as here, boxplots will tend to show a lot of outliers that are not really outliers.8 There are more residuals that are negative boxplot-outliers than positive ones, which echoes what we saw in the normal quantile plot of residuals.

Overall, I would say that things are much improved. Getting rid of the fanning-out was the major priority, and we have done that. Having residuals that are a bit left-skewed as a result is, to me, a small price to pay. I think I am happy enough with this, and therefore the revised decision that being exposed to smoking does have a negative effect on (log of) FEV, over and above the effects of the other explanatory variables, is the one we should stand by.

Footnotes

  1. You might think this bends, but with this much scatter you will have a very hard time demonstrating that a curve is better than a straight line.↩︎

  2. Basing a decision about anything on one observation is never a good idea.↩︎

  3. Typing this all out again would be way too much work.↩︎

  4. But I can’t remember whether I showed you this in lecture.↩︎

  5. The noun is “pivot”, while “longer” is a comparative adjective, and the thing that should be pluralized is the noun. Hence “pivots-longer”.↩︎

  6. I can tell I’m getting tired, because it took me about three goes to get this right.↩︎

  7. It’s actually only a tiny change.↩︎

  8. Mary Eleanor Spear and John Tukey were imagining boxplots being used for much smaller sample sizes than we have here, say sample sizes of 30.↩︎