Worksheet 11

Published

December 4, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial. I wanted to give you some extra practice, so there are three multiple regression scenarios and a function-writing one. If you want to practice writing a function, skip ahead to question 21 and the preamble above it.

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(MASS, exclude = "select")
library(leaps)

Behavior Risk Factor Surveillance System 2

The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:

FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
SmokeEveryday: the percent of respondents who smoke every day.
EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted.

You might have seen these data before; if you have, you might recall that we omitted Utah, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.

(This is, I suspect, a rather old question.)

Read in and display (a little of) the data.

Solution

A simple read_csv:

my_url <- "http://ritsokiguess.site/datafiles/brfss_no_utah.csv"
brfss <- read_csv(my_url)

Rows: 49 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (30): Age18_24, Age25_34, Age35_44, Age45_54, Age55_64, Age65orMore, EdL...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

brfss

There are 49 rows, one for each state apart from Utah. There are a lot of columns, 31 altogether. You can check for yourself, by looking or as below, that the columns I named above do actually exist. There is no need to “get rid” of the other columns, unless you have a strong desire to do so.

Extra 1: read_csv, like the other read_ functions, gives a message saying what kind of thing it found. When there are a lot of columns, as there are here, it will tell you what most of the columns are (decimal numbers, here) and list any exceptions (the state is text).

Extra 2: the “automated” way to see whether you have all the columns you need is to try to select them, copying and pasting their names from the question. I’m including the state names as well, since they are identifiers:

brfss %>% select(State, FruitVeg5, SmokeEveryday, EdCollege)

If you think it’s better, once you’ve done this, you can save the cut-down dataframe with -> after the select, and use that the rest of the way. We will be using the state names later, so reading the rest of the question to make sure of what you’ll need later is smart.

$\blacksquare$

Make a suitable plot of the percent of people eating five fruits and vegetables every day against the percent smoking every day. Add a smooth trend if appropriate.

Solution

All the variables in this question are quantitative, so all the plots will be scatterplots. The response variable is FruitVeg5, so that goes on the $y$ (vertical) axis. One of the main purposes of a scatterplot is to discover whether there is a relationship between the variables, and if so, what kind of relationship, so a smooth trend makes sense:

ggplot(brfss, aes(y = FruitVeg5, x = SmokeEveryday)) + geom_point() +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

At this point, a smooth trend is better than a regression line, because we don’t know whether the relationship is straight or not.

$\blacksquare$

Describe any trends you see on your picture (think form, direction, strength).

Solution

I see:

form: more or less linear (no real suggestion of a curve, especially given the (considerable) scatter of the points)
direction: a downward trend
strength: weak to moderate (after all, there clearly is a downward trend; it just isn’t very strong).

$\blacksquare$

Draw a scatterplot of fruit and vegetable consumption vs rate of college graduation. Describe what you see on your scatterplot (form, direction, strength).

Solution

Copy the code, change the variable name:

ggplot(brfss, aes(y = FruitVeg5, x = EdCollege)) + geom_point() +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Here, I see:

form: more or less linear. The curve at the bottom end seems to be caused by that state that has an unusually high FruitVeg5 value (over 25) despite having the second-lowest EdCollege value. So I am not going to worry about that. You might also note a “kink” in the trend at an EdCollege value around 35. I think the best description of the form of the trend is two straight lines glued together, rather than a curve.
direction: an upward trend.
strength: I would call this a moderately strong trend: a bit stronger than the other one, since the points seem to be overall closer to the trend. (This is a judgement call.)

$\blacksquare$

Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.

Solution

This:

brfss.1 <- lm(FruitVeg5 ~ EdCollege + SmokeEveryday, data = brfss)
summary(brfss.1)


Call:
lm(formula = FruitVeg5 ~ EdCollege + SmokeEveryday, data = brfss)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5881 -1.4681  0.3668  1.8217  6.5814 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    22.1165     5.9015   3.748 0.000497 ***
EdCollege       0.2351     0.1044   2.253 0.029066 *  
SmokeEveryday  -0.4359     0.2033  -2.144 0.037371 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.747 on 46 degrees of freedom
Multiple R-squared:  0.4266,    Adjusted R-squared:  0.4017 
F-statistic: 17.11 on 2 and 46 DF,  p-value: 2.783e-06

$\blacksquare$

Look at the two numbers in the Estimate column, not including the intercept. Do their signs, positive or negative, makes sense in the context of the data? Explain briefly.

Solution

The slope for EdCollege is positive. This means that a state with a higher proportion of college graduates will also tend to have a higher proportion of people who eat their fruit and veg, holding smoking constant. This makes sense because you would expect people who have completed college to be more educated in general, and you’d expect such people to know about the importance of a good diet.

The slope for SmokeEveryday is negative. This means that a state with more smokers will have fewer people eating fruits and vegetables. This is a direct health link: people that eat fruits and vegetables care about their health, and people who smoke do not, so in the data we tend to have states whose inhabitants are more or less healthy.

$\blacksquare$

What does the last number in the rightmost column of the regression output tell us? Does this make sense in the light of graphs you have drawn? Explain briefly.

Solution

On my output, that number is 0.037. This is the P-value for testing whether the slope for SmokeEveryday is zero or not zero, or, equivalently, whether SmokeEveryday has anything to add to a regression that already contains EdCollege. (Or, also equivalently, whether SmokeEveryday could be removed from this regression.) This P-value is less than 0.05 and thus significant, and so the slope is definitely not zero (or, the explanatory variable does have something to add, or should not be removed.)

I was hoping that you would add the explanatory variables in the same order that I did. If you had the explanatory variables the other way around, you will be referring to the other one and the other plot you drew.

Think back to the adjective you used earlier to describe the strength of the relationship on your scatterplot, “weak” or “moderate” or something like that. The relationship didn’t look especially convincing, but the P-value says that it is more convincing than chance. (If there were no relationship, it would look more random than this.) You are entitled to be surprised that the relationship is significant, given the appearance of the scatterplot.

$\blacksquare$

Draw a complete set of residual plots for this regression (four plots). Do you have any concerns? Explain briefly.

Solution

Residuals against fitted values, normal quantile plot of residuals, residuals against each of the two explanatory variables.

Residuals against fitted values:

ggplot(brfss.1, aes(x = .fitted, y = .resid)) + geom_point()

Pretty much random. That point top left with the biggest (most positive) residual is the only one to be at all worried about.

Normal quantile plot of residuals:

ggplot(brfss.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

On the line, really, all the way along. I wouldn’t stress about those three most negative residuals; they are really not far below the line. Also: that most positive residual, which shows up on the other plots as “unusually large”, is indeed quite a bit bigger than the second largest one, but it is close to the line here, so it is pretty much the size you’d expect the largest residual to be (and thus ought not to be a big concern).

Residuals against explanatory variables. I would use augment from broom to create a dataframe with data and residuals in it:

library(broom)
brfss.1 %>% augment(brfss) -> d
d

and then plot one at a time (fine here):

ggplot(d, aes(x = EdCollege, y = .resid)) + geom_point()

Not bad. You could choose to be bothered by that point top left, which seems to have an unusually large (positive) residual. And:

ggplot(d, aes(x = SmokeEveryday, y = .resid)) + geom_point()

No problems, except possibly for that point with the big positive residual, now top right. I would declare myself to be happy overall. There are no systematic problems, such as graphs that are completely the wrong shape, and the normal quantile plot indicates that the “large” positive residual is not actually too large.

Extra: of course, I wanted to find out which state had that big residual. People in that state were more likely to eat the five fruits and vegetables than you would have expected from knowing how many of them smoked (a lot) or had a college education (not many). The same techniques we used earlier to find Utah will serve again here, eg:

d %>% arrange(desc(.resid))

Tennessee. 26.4% were observed to eat their five fruits and vegetables, but the prediction is only 19.8%, as you can discover by scrolling across enough. Maybe there are other variables on which this state is unusual. The one that jumps out at me is how few people aged 18–24 there are. I’d have to do some digging to find out what those other variables actually are.

If you want to investigate yourself, here is the survey website, and the 2019 survey questionnaire. The data are collected and recorded at an individual level (with nothing identifying the person who responded), and then aggregated and summarized to produce the data that you saw. The textbook I got the data from was published in 2012, so the data would come from an earlier questionnaire.

$\blacksquare$

Veggie burgers

You like hamburgers, but you are a vegetarian. What to do? Today, there are many brands of hamburgers without any meat in them. Some of these are designed to taste like meat, and some have their own flavour. A magazine rated the flavour and texture of 12 different (numbered) brands of meatless hamburgers (to give a rating score between 0 and 100), along with the price (in cents), the number of calories, the grams of fat, and the milligrams of sodium. These measurements are per burger. Is it possible to predict the rating score of a brand of meatless hamburger from the other measurements, and if so, how? The data are in http://ritsokiguess.site/datafiles/veggie-burgers.txt, in aligned columns.

Read in and display (most of) the data.

Solution

Aligned columns says read_table:

my_url <- "http://ritsokiguess.site/datafiles/veggie-burgers.txt"
# my_url <- "../../datafiles/docs/veggie-burgers.txt"
burgers <- read_table(my_url)


── Column specification ────────────────────────────────────────────────────────
cols(
  brand = col_double(),
  score = col_double(),
  price = col_double(),
  calories = col_double(),
  fat = col_double(),
  sodium = col_double()
)

burgers

There are 12 rows, one per brand, and the columns are as promised (and all quantitative, except for brand, which is an identifier).

$\blacksquare$

Fit a suitable regression to predict score from the other measured variables. Display the results.

Solution

The brand is an identifier, so skip that:

burgers.1 <- lm(score ~ price + calories + fat + sodium, data = burgers)
summary(burgers.1)


Call:
lm(formula = score ~ price + calories + fat + sodium, data = burgers)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.376  -5.358   1.843   7.027  13.454 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 59.84879   35.67526   1.678   0.1373  
price        0.12868    0.33907   0.380   0.7156  
calories    -0.58048    0.28876  -2.010   0.0843 .
fat          8.49825    3.47215   2.448   0.0443 *
sodium       0.04876    0.04062   1.200   0.2690  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.72 on 7 degrees of freedom
Multiple R-squared:  0.4991,    Adjusted R-squared:  0.2128 
F-statistic: 1.744 on 4 and 7 DF,  p-value: 0.2443

My numbering scheme for models is based on the name of the dataframe. Base yours on the response variable score if you prefer. But have a scheme, since there is going to be more than one model in this question.

$\blacksquare$

It looks as if both price and sodium will be able to be removed from this regression. Do so, explain briefly why another test is necessary, and do that other test. What do you conclude?

Solution

There are several things to keep straight. The first thing is to fit a model without price and sodium. The easiest way to do this is to copy, paste and edit:

burgers.2 <- lm(score ~ calories + fat, data = burgers)
summary(burgers.2)


Call:
lm(formula = score ~ calories + fat, data = burgers)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.919  -6.786  -4.352  11.198  16.786 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  75.5907    24.4377   3.093   0.0129 *
calories     -0.4600     0.2701  -1.703   0.1227  
fat           7.7047     3.3703   2.286   0.0481 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.56 on 9 degrees of freedom
Multiple R-squared:  0.3725,    Adjusted R-squared:  0.233 
F-statistic: 2.671 on 2 and 9 DF,  p-value: 0.1228

However, we removed two explanatory variables at once. This is not supported by the $t$-tests in the output from burgers.1, because they only say what will happen if we remove one $x$-variable. Hence, we need a test that says whether removing those two $x$s at once was reasonable. That is this test:

anova(burgers.2, burgers.1, test = "F")

(That is F in quotes, saying to do an $F$-test, not that anything is FALSE.)

This is saying that there is no significant difference in fit between the two models, and so we should prefer the smaller, simpler one burgers.2, because it fits just as well as the bigger, more complicated one (and therefore we do not need the extra complication).

$\blacksquare$

What happens if you do backward elimination from here, starting from the best model found so far? Does the result seem to make sense? Explain briefly.

Solution

The previous part told us that predicting score from just calories and fat was the best thing to do so far. That was the model I called burgers.2:

summary(burgers.2)


Call:
lm(formula = score ~ calories + fat, data = burgers)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.919  -6.786  -4.352  11.198  16.786 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  75.5907    24.4377   3.093   0.0129 *
calories     -0.4600     0.2701  -1.703   0.1227  
fat           7.7047     3.3703   2.286   0.0481 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.56 on 9 degrees of freedom
Multiple R-squared:  0.3725,    Adjusted R-squared:  0.233 
F-statistic: 2.671 on 2 and 9 DF,  p-value: 0.1228

Evidently, calories comes out now:

burgers.3 <- lm(score ~ fat, data = burgers)
summary(burgers.3)


Call:
lm(formula = score ~ fat, data = burgers)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.186  -8.538  -2.136   5.898  22.814 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   35.390      6.906   5.124 0.000448 ***
fat            2.949      2.059   1.432 0.182578    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.7 on 10 degrees of freedom
Multiple R-squared:  0.1702,    Adjusted R-squared:  0.08724 
F-statistic: 2.051 on 1 and 10 DF,  p-value: 0.1826

Now, there is only one explanatory variable left, and it is no longer significant, so it too has to come out now! This seems to make no sense, since fat was definitely significant before, and we would expect it still to be significant after removing something that was not significant. (Sometimes this happens; this is one of those cases.)

Another way of expressing your surprise is to look at the R-squared values (or the adjusted R-squared values) for the models we have fit so far:

Model	Explanatory	R-squared	Adj R-sq
1	`price + calories + fat + sodium`	0.50	0.21
2	`calories + fat`	0.37	0.23
3	`fat`	0.17	0.09

As we go through the models, R-squared goes dramatically down (it will go down because it always goes down when you take things out, but this seems too dramatic). Adjusted R-squared goes up when we take out price and sodium, but it too goes sharply down when we take out calories, which doesn’t seem right.

There is no need to go any further than this, but if you want to take out fat as well, leaving you with no explanatory variables at all, there are a couple of non-obvious ways to do it. One is to use update:

burgers.4 <- update(burgers.3, . ~ . -fat)
summary(burgers.4)


Call:
lm(formula = score ~ 1, data = burgers)

Residuals:
   Min     1Q Median     3Q    Max 
-17.50 -10.50  -3.00   4.25  26.50 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   43.500      4.139   10.51 4.49e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.34 on 11 degrees of freedom

This says that the predicted score is 43.5, regardless of the values of anything else! There is no R-squared displayed, because that is zero for a model with no $x$-variables.

The other way is to find out that R understands 1 to mean a model with just an intercept:

burgers.4a <- lm(score ~ 1, data = burgers)
summary(burgers.4a)


Call:
lm(formula = score ~ 1, data = burgers)

Residuals:
   Min     1Q Median     3Q    Max 
-17.50 -10.50  -3.00   4.25  26.50 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   43.500      4.139   10.51 4.49e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.34 on 11 degrees of freedom

Once again, the R-squared is zero.

$\blacksquare$

Find the best model according to (your choice) AIC or adjusted R-squared. Bear in mind that the best model might have more than two explanatory variables in it. Hint: step finds the best model according to AIC.

Solution

You are most likely to have heard of AIC in connection with step. If you save the output from step in something, you obtain a “fitted model object” that you can look at. Start from the model with all four $x$s in it (the reason for my last sentence):

burgers.5 <- step(burgers.1, direction = "backward")

Start:  AIC=64.57
score ~ price + calories + fat + sodium

           Df Sum of Sq    RSS    AIC
- price     1     23.30 1155.9 62.812
<none>                  1132.6 64.568
- sodium    1    233.15 1365.7 64.814
- calories  1    653.84 1786.4 68.037
- fat       1    969.23 2101.8 69.988

Step:  AIC=62.81
score ~ calories + fat + sodium

           Df Sum of Sq    RSS    AIC
<none>                  1155.9 62.812
- sodium    1    262.96 1418.8 63.272
- calories  1    647.02 1802.9 66.147
- fat       1    948.31 2104.2 68.001

summary(burgers.5)


Call:
lm(formula = score ~ calories + fat + sodium, data = burgers)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.883  -3.934   1.345   6.514  14.774 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 69.42253   23.83776   2.912   0.0195 *
calories    -0.57718    0.27275  -2.116   0.0672 .
fat          8.35880    3.26271   2.562   0.0335 *
sodium       0.05116    0.03792   1.349   0.2143  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.02 on 8 degrees of freedom
Multiple R-squared:  0.4888,    Adjusted R-squared:  0.2971 
F-statistic:  2.55 on 3 and 8 DF,  p-value: 0.1289

The best model here has three $x$s: calories, fat and also sodium, of which the latter is nowhere near significant. (But, we know what happens if we take it out: the other two $x$s become less significant.)

To get the best model by adjusted R-squared, you could (in principle) fit every single model and see which one has the highest adjusted R-squared. But this is exactly what regsubsets from package leaps does (you may have seen this with the asphalt data in lecture, depending on how much I talked about that):

leaps <- regsubsets(score ~ price + sodium + calories + fat, 
                    data = burgers, nbest = 2)
s <- summary(leaps)

It is a bit tricky to get the best adjusted R-squared out of this, but the lecture notes have how to do it:

with(s, data.frame(adjr2, outmat))

or, if you want to be fancy,

with(s, data.frame(adjr2, outmat)) %>% 
  arrange(desc(adjr2))

The second one sorts the adjusted-R-squared values in order.¹

Either way, the best model according to adjusted R-squared has sodium, calories and fat in it, the same as the best AIC model.

$\blacksquare$

For the best model obtained in the previous part, do a residual analysis (this will be three or about six plots, depending on how you count them).

Solution

Strategy note: if you couldn’t do the previous part, do this for a model that you could fit, for example the one with calories and fat in it. (I do this one below.)

If you used step and saved the results, you’ll have what I called burgers.5, but if not, you’ll have to fit it first.²

First things first: residuals against fitted values, and normal quantile plot of residuals:

ggplot(burgers.5, aes(x = .fitted, y = .resid)) + geom_point()

and then

ggplot(burgers.5, aes(sample = .resid)) + stat_qq() + stat_qq_line()

Then, we need residuals against the four explanatory variables. This could be four plots, or one plot with four facets, depending how you do it (hence the words in the question). The idea is that you include all of your explanatory variables in the plots, including ones you removed, on the grounds that there might actually have been a non-linear relationship with one of them that the original regression missed.³

I think the first thing is to get the residuals and the original data in the same dataframe, most easily done using augment from broom:

library(broom)
burgers.5 %>% augment(burgers) -> d
d

and then use this dataframe to make your plots, either one at a time, or all together:

d %>% pivot_longer(price:sodium, 
                   names_to = "xname", values_to = "xval") %>% 
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~xname, scales = "free")

I forgot the scales = "free" the first time, and got some very silly looking plots.

Extra: the model with calories and fat (only), my burgers.2, has residual plots that look like this:

Residuals against fitted values, and normal quantile plot of residuals:

ggplot(burgers.2, aes(x = .fitted, y = .resid)) + geom_point()

and then

ggplot(burgers.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()

Then, we need residuals against the explanatory variables. This could be four plots, or one plot with four facets, depending how you do it. I think the first thing is to get the residuals and the original data in the same dataframe, most easily done using augment from broom:

burgers.2 %>% augment(burgers) -> d
d

and then use this dataframe to make your plots, either one at a time, or all together, like this:

d %>% pivot_longer(c(price:sodium), names_to = "xname", values_to = "xval") %>% 
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~xname, scales = "free")

My major concern here is on the fitted values vs. residuals: the most negative residual is at the bottom right (with the largest fitted value), and this makes it look as if the other residuals have an upward trend (these two things tend to go together).

$\blacksquare$

Do you see any problems in your residual plots, or not? Explain briefly.

Solution

Go through your residual plots one by one and comment on them. On my three-variable model that came out of step or regsubsets:

residual vs fitted: I am slightly concerned by the three points on the right: the two most positive residuals and the one most negative one. These correspond to the three brands predicted to have the highest rating. Maybe there is just a difficulty in predicting high ratings. I wouldn’t go so far as to call this “fanning-out”: I would want to see a clearer pattern for this. Otherwise, I am happy here. Another take you might have here is that the bottom-right point, with the most negative residual and the largest fitted value, is an outlier and the other points have an upward trend.⁴
normal quantile plot of residuals: maybe the two most negative residuals are a bit too small. Or, you could about as reasonably say that the points are all fairly close to the line.
residuals against explanatory variables: those look like three pretty random plots to me.

$\blacksquare$

Construction projects

How long does it take to complete a large construction project? This might depend on a number of things. In one study of this issue, fifteen construction projects were investigated (as a random sample of “all possible construction projects”). Five variables were measured:

time taken to complete the project (days)
size of the contract (thousands of dollars)
number of work days affected by bad weather
whether or not there was a worker’s strike during the construction (1 = yes, 0 = no)
number of subcontractors involved in the project.

Subcontractors are people like electricians, plumbers, and so on, that are hired by the company overseeing the whole project to complete specific jobs. A large project might have a number of subcontractors coming in at different times to do parts of the work.

The data are in http://ritsokiguess.site/datafiles/construction.csv.

Read in and display (some of) the data.

Solution

The usual for a .csv:

my_url <- "http://ritsokiguess.site/datafiles/construction.csv"
construction <- read_csv(my_url)

Rows: 15 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): completion_time, size, bad_weather, strike, subcontractors

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

construction

15 rows (projects) and 5 (quantitative) variables. (strike is really categorical, but we are treating it as quantitative here, which is all right because it has only two categories.)

$\blacksquare$

Fit a suitable regression, predicting the response variable from everything else, and display the results.

Solution

Predict completion time from everything else:

construction.1 <- 
  lm(completion_time ~ size + bad_weather + strike + subcontractors, 
     data = construction)
summary(construction.1)


Call:
lm(formula = completion_time ~ size + bad_weather + strike + 
    subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6284  -5.2983   0.1322   5.7912  17.6356 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -1.720891  11.647578  -0.148   0.8855  
size           -0.007873   0.006226  -1.265   0.2347  
bad_weather     0.670164   0.999093   0.671   0.5176  
strike         27.713734  11.363613   2.439   0.0349 *
subcontractors  3.534685   1.933829   1.828   0.0975 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.84 on 10 degrees of freedom
Multiple R-squared:  0.8469,    Adjusted R-squared:  0.7857 
F-statistic: 13.83 on 4 and 10 DF,  p-value: 0.0004398

$\blacksquare$

Build a good model for predicting completion time, using backward elimination with $\alpha = 0.10$. Describe your process.

Solution

This means to eliminate useless explanatory variables, starting with the least significant (but leaving the intercept in, because that always stays). You’ll see at the end why I said to use $\alpha$ of 0.10 rather than the usual 0.05.

The first step is to remove the explanatory variable with the largest P-value, which is bad_weather. There are two ways to fit a model that is a small difference from a previous one. The way that came to you is probably copy, paste and edit:⁵

construction.2a <- 
  lm(completion_time ~ size + strike + subcontractors, 
     data = construction)
summary(construction.2a)


Call:
lm(formula = completion_time ~ size + strike + subcontractors, 
    data = construction)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.620  -6.707  -1.041   4.628  21.424 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)     0.481240  10.892336   0.044   0.9656  
size           -0.007178   0.005983  -1.200   0.2555  
strike         29.137658  10.880871   2.678   0.0215 *
subcontractors  3.964190   1.778536   2.229   0.0476 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 11 degrees of freedom
Multiple R-squared:   0.84, Adjusted R-squared:  0.7964 
F-statistic: 19.26 on 3 and 11 DF,  p-value: 0.0001099

The other way, which I think is better,⁶ is to use update, which takes a model to update, and some instructions for how to update it:

construction.2 <- update(construction.1, . ~ . - bad_weather)
summary(construction.2)


Call:
lm(formula = completion_time ~ size + strike + subcontractors, 
    data = construction)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.620  -6.707  -1.041   4.628  21.424 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)     0.481240  10.892336   0.044   0.9656  
size           -0.007178   0.005983  -1.200   0.2555  
strike         29.137658  10.880871   2.678   0.0215 *
subcontractors  3.964190   1.778536   2.229   0.0476 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 11 degrees of freedom
Multiple R-squared:   0.84, Adjusted R-squared:  0.7964 
F-statistic: 19.26 on 3 and 11 DF,  p-value: 0.0001099

The idea is that someone reading your code will immediately see that this is taking a (large) model and making a small change to it. On the other hand, unless you explicitly describe what is going on, the copy-paste-edit approach will make your reader wonder “how is construction.2a different from construction.1 again?” and they will have to go through and check how the two models are different. This is work that you need to be doing, not your reader.

Code-wise, the second input to update says “keep the response variable the same, and keep all the explanatory variables the same except take out bad_weather”.

Having gotten to this point, look again to see that size is still not significant, so that can come out as well (but it looks as if the other two will have to stay):

construction.3 <- update(construction.2, . ~ . - size)
summary(construction.3)


Call:
lm(formula = completion_time ~ strike + subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1842  -5.6759   0.8662   5.8523  19.8045 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.168     10.352   0.499   0.6267  
strike           29.163     11.078   2.633   0.0219 *
subcontractors    2.989      1.610   1.856   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.75 on 12 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.789 
F-statistic: 27.17 on 2 and 12 DF,  p-value: 3.502e-05

At $\alpha = 0.10$, this is where we stop, because strike and subcontractors are both significant (at this level). (This is why I put the value in for $\alpha$; otherwise you would need to take out subcontractors as well.)

I actually copied, pasted and edited the update line here.

Extra 1: This could also be done using step, but since that is an automatic process, you will need to describe what is going on:

construction.4 <- step(construction.1, direction = "backward", test = "F")

Start:  AIC=78.05
completion_time ~ size + bad_weather + strike + subcontractors

                 Df Sum of Sq    RSS    AIC F value  Pr(>F)  
- bad_weather     1     63.04 1464.2 76.715  0.4499 0.51756  
<none>                        1401.1 78.055                  
- size            1    224.07 1625.2 78.280  1.5993 0.23468  
- subcontractors  1    468.10 1869.2 80.378  3.3409 0.09752 .
- strike          1    833.35 2234.5 83.056  5.9478 0.03492 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=76.71
completion_time ~ size + strike + subcontractors

                 Df Sum of Sq    RSS    AIC F value  Pr(>F)  
- size            1    191.54 1655.7 76.559  1.4391 0.25549  
<none>                        1464.2 76.715                  
- subcontractors  1    661.27 2125.4 80.305  4.9680 0.04762 *
- strike          1    954.50 2418.7 82.244  7.1710 0.02149 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=76.56
completion_time ~ strike + subcontractors

                 Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>                        1655.7 76.559                  
- subcontractors  1    475.21 2130.9 78.344  3.4442 0.08819 .
- strike          1    956.17 2611.9 81.397  6.9301 0.02187 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First bad_weather is removed, and then size. At this point, it is best to remove nothing else, leaving subcontractors and strike.

If you saved the result of step, you’ll get the final fitted model, which you can then look at:

summary(construction.4)


Call:
lm(formula = completion_time ~ strike + subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1842  -5.6759   0.8662   5.8523  19.8045 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.168     10.352   0.499   0.6267  
strike           29.163     11.078   2.633   0.0219 *
subcontractors    2.989      1.610   1.856   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.75 on 12 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.789 
F-statistic: 27.17 on 2 and 12 DF,  p-value: 3.502e-05

the same as before.

Extra 2: Another way to go is to look at the output from construction.1 again:

summary(construction.1)


Call:
lm(formula = completion_time ~ size + bad_weather + strike + 
    subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6284  -5.2983   0.1322   5.7912  17.6356 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -1.720891  11.647578  -0.148   0.8855  
size           -0.007873   0.006226  -1.265   0.2347  
bad_weather     0.670164   0.999093   0.671   0.5176  
strike         27.713734  11.363613   2.439   0.0349 *
subcontractors  3.534685   1.933829   1.828   0.0975 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.84 on 10 degrees of freedom
Multiple R-squared:  0.8469,    Adjusted R-squared:  0.7857 
F-statistic: 13.83 on 4 and 10 DF,  p-value: 0.0004398

and say “well, it looks as if I can remove both size and bad_weather from this. This is dangerous, though, because the P-values only apply to the removal of one explanatory variable from the model you are looking at. You can do it, but you must follow up with a test that checks whether removing both those two explanatory variables was OK.

The first step is to fit the model without the two explanatory variables you want to remove:

construction.5 <- update(construction.1, . ~ . - size - bad_weather)
# summary(construction.5)

The obligatory next step is compare the fits of the two models using anova, smaller model first:

anova(construction.5, construction.1, test = "F")

This is not significant, so the two models fit equally well, and therefore we should go with the smaller, simpler model. That is to say, in this case it was all right to take out both explanatory variables, but we needed to do this test to be sure.

If the anova result is significant, that means that the smaller model fits significantly worse than the bigger one, and that taking out all the things you took out was a mistake. If this happens to you, you need to go back and remove fewer explanatory variables; if you had gone with one at a time, you would probably have found that one of them became significant during the process, and you would have ended up not removing that one.

By way of sanity-checking, the R-squared for the model with everything in (my construction.1) was 84.7%, and for the model with only strike and subcontractors was 81.9%. Taking out those two explanatory variables made only a small difference to R-squared, so it was probably a good idea (but you need to do the test to be sure).

Extra 3: This is where we got to before:

summary(construction.3)


Call:
lm(formula = completion_time ~ strike + subcontractors, data = construction)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.1842  -5.6759   0.8662   5.8523  19.8045 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.168     10.352   0.499   0.6267  
strike           29.163     11.078   2.633   0.0219 *
subcontractors    2.989      1.610   1.856   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.75 on 12 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.789 
F-statistic: 27.17 on 2 and 12 DF,  p-value: 3.502e-05

What happens if we take out subcontractors now, as we would if working at $\alpha = 0.05$?

construction.6 <- update(construction.3, . ~ . - subcontractors)
summary(construction.6)


Call:
lm(formula = completion_time ~ strike, data = construction)

Residuals:
   Min     1Q Median     3Q    Max 
-19.00  -8.55   1.00   6.45  21.90 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.100      4.049   5.706 7.22e-05 ***
strike        45.900      7.012   6.545 1.87e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.8 on 13 degrees of freedom
Multiple R-squared:  0.7672,    Adjusted R-squared:  0.7493 
F-statistic: 42.84 on 1 and 13 DF,  p-value: 1.867e-05

R-squared has gone down by almost 5 percentage points, which might be starting to concern you as being too much. Also, adjusted R-squared has gone down. So maybe removing subcontractors, even though it wasn’t quite significant by the usual standards, was a bit too much. (At least strike is now strongly significant, and there is no doubt that this has to stay.)

Extra 4: another way you might have hunted for a better model than my construction.1 is to use regsubsets (from package leaps) to find the model with the largest adjusted R-squared:

construction.7 <- regsubsets(completion_time ~ size + bad_weather + strike + 
    subcontractors, data = construction, nbest = 2)
s <- summary(construction.7)
with(s, data.frame(adjr2, outmat)) %>% 
  arrange(desc(adjr2))

It is a very close call, but the model that has size in as well is slightly superior to the one with just strike and subcontractors, and the model with all 4 explanatory variables (despite the lack of significance of two of them) is actually not much worse.

Had I not specified to use backward elimination, this would also have been a reasonable way to find a good model, but it does not answer the question as I asked it.

To take away from this: sometimes it is just as good to keep explanatory variables that are not significant (ones where the P-value is smallish rather than actually smaller than 0.05). This is what happens with step as well; often you end up keeping $x$-variables that have P-values like 0.10 or maybe even bigger. step uses AIC rather than adjusted R-squared, but the principle, and usually the results, are the same.

Extra 5: you might also be thinking of trying squared terms, but with only 15 observations, we don’t really have enough data to justify that. (Or, you can try it, but those terms are very unlikely to be significant.) My take is that you want a reason to add a squared term, such as a curve on the plot of residuals vs $x$-variables. See the next part, where I don’t quite get to that point for one of the plots.

$\blacksquare$

For your best model, obtain a full set of residual plots. This means:

residuals vs fitted values
normal quantile plot of residuals
residuals vs each of the explanatory variables (all of them, not just the ones in your final model).

Solution

1 point for each of the first two, and 2 points for the last one since it is more work.

Taking these in turn, and noting that construction.3 is my best model:

ggplot(construction.3, aes(x = .fitted, y = .resid)) + geom_point()

ggplot(construction.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()

construction.3 %>% augment(construction) %>% 
  pivot_longer(size:subcontractors, 
               names_to = "xnames", values_to = "xvals") %>%
  ggplot(aes(x = xvals, y = .resid)) + geom_point() +
  facet_wrap( ~ xnames, scales = "free")

On the last one, I did this:

use augment from broom to put the residuals into a dataframe that also had the data (“augment the regression with the data”). There are other ways to do this. Find one.
put all the $x$-variable values I cared about into one column (with a second column saying which $x$-variable they came from)
plotted the residuals vs. the $x$s, using a different facet for each one, and using different scales because each $x$-variable is on a different scale.

If you prefer, plot the residuals against the $x$s one by one, but the facets idea is best.

$\blacksquare$

Comment briefly on your plots, and how they support the regression being basically satisfactory (or not, if you think it’s not). You may assume that all the data values were correctly recorded.

Solution

Same point scale: 1 point each for sensible comment on residuals vs. fitted values and for normal quantile plot, 2 points for comment collectively on the residuals vs. $x$-variables.

Remember that having two clouds of points is not in itself a bad thing; what matters is whether changing the value of the variable on the $x$-axis tells you anything about what the residual is going to be. Another way of saying that the residuals are random is that they are unpredictable given anything else.

There are about five fitted values that are clearly larger than the others, but they have residuals that average out to zero and have about the same spread as the residuals for the other fitted values. Said differently, there are two random clouds of residuals, with no pattern, either within each cloud or between the two clouds.
The points on the normal quantile plot are close to the line. Only the most negative residual is at all off the line (it is more negative than expected). (This is not really an “outlier” residual, since it doesn’t make sense to talk about residuals as having outliers, though it might be a residual that indicates a possible outlying data value. More on this later.)
bad_weather seems to have a more or less random residual plot, except possibly for that point top right (the very positive residual, that goes with a project that had a lot of bad weather).
size: this is a weird pattern: there are a whole bunch of small projects that have very reasonable-looking residuals, but for the bigger projects, the residuals go uphill and then there is one point in the bottom right corner: a big project with a very negative residual, that looks very different from the others. I might try a squared term in size, but with some reservations, because most of the evidence of a curve comes from that bottom-right point, and making an inference from one point only is not generally a good idea.
strike only takes two values 0 and 1, but the residuals attached to each of these values seem to be random for each, averaging out to zero with similar spreads.
the pattern of residuals against subcontractors has an odd shape: there may be a curved trend in all the points except the one bottom right, and a possible outlier (the point bottom right). Without the outlier, I would try a squared term in subcontractors, but with it, it is difficult to know what to do, and difficult to say that something is more wrong than chance. The last sentence of the question rules out the possibility of removing the observation with the very negative residual: it is correct, so we have to include it in the model.

That observation with the very negative residual is the apparent outlier on the last plot, the one that is most off the line in the normal quantile plot, and is visible on the other plots as well. I told you that this outlier (and all the other observations) were correctly recorded, so there is no reason to remove this observation.

This is really about the only possible problem with the regression, so I think “basically satisfactory” is a reasonable description of the overall story from these plots. Feel free to disagree, but the last sentence of your answer has to address the “basically satisfactory” somehow, whether you agree with it or whether you don’t.

Extra: which observation is that one with the very negative residual?

construction.3 %>% augment(construction) %>% 
  filter(.resid < -20)

Hint: you need to be careful with the spacing here:

construction.3 %>% augment(construction) %>% 
  filter(.resid <-20)

Error in `filter()`:
ℹ In argument: `.resid <- 20`.
Caused by error:
! `..1` must be a logical vector, not the number 20.

This doesn’t work because R reads this as saving the value 20 in .resid, which is not what you want to do. The error is saying that the thing inside the filter has to come out as TRUE or FALSE, not as saving to a variable as R reads you as trying to do.

Aside: you can do this:

a <- b <- 10
a

[1] 10

[1] 10

The “value” of an assignment is the thing on the right of it, so this code first assigns 10 to b, and then assigns the value of that assignment to a, and that is also 10.

In my error above,

construction.3 %>% augment(construction) %>% 
  filter(.resid <-20)

Error in `filter()`:
ℹ In argument: `.resid <- 20`.
Caused by error:
! `..1` must be a logical vector, not the number 20.

the value of the apparent-assignment inside the filter is 20, so the second line is equivalent to filter(20), which you cannot do, because the mechanism for choosing rows has to be a logical, something that is true or false, not a number (double) like 20. This is why the error message came out as it did.

To avoid this error, use the space as I did, or put the negative number in brackets:

construction.3 %>% augment(construction) %>% 
  filter(.resid <(-20))

How does that compare with the rest of the data?

summary(construction)

 completion_time      size         bad_weather         strike      
 Min.   : 7.0    Min.   :  50.0   Min.   : 3.000   Min.   :0.0000  
 1st Qu.:18.0    1st Qu.:  72.5   1st Qu.: 6.000   1st Qu.:0.0000  
 Median :30.0    Median : 100.0   Median : 8.000   Median :0.0000  
 Mean   :38.4    Mean   : 415.0   Mean   : 9.467   Mean   :0.3333  
 3rd Qu.:55.0    3rd Qu.: 400.0   3rd Qu.:13.000   3rd Qu.:1.0000  
 Max.   :90.0    Max.   :2600.0   Max.   :20.000   Max.   :1.0000  
 subcontractors  
 Min.   : 3.000  
 1st Qu.: 5.500  
 Median : 8.000  
 Mean   : 7.867  
 3rd Qu.:10.500  
 Max.   :13.000

This is a big project: the size and number of subcontractors are the largest of all. There was also a strike during this project, which is important because that was in our best model. The negative residual means that despite all this, the project was completed in less time than predicted (50 days vs. predicted 73 days). There might have been something else about this project that had an impact on the completion time (that we don’t know about), for example the people running this project could have paid the subcontractors extra to complete their work sooner (but still, we hope, safely and in compliance with building codes).

$\blacksquare$

Writing a function to do wide $t$-test

The way we know how to run a two-sample $t$-test is to arrange the data “long”: that is, to have one column with all the data in it, and a second column saying which of the two groups each data value belongs to. However, sometimes we get data in two separate columns, and we would like to make a function that will run the $t$-test on data in this form.

As an example, suppose we have data on student test scores for some students who took a course online (asynchronously), and some other students who took the same course in-person. The students who took the course online scored 32, 37, 35, 28; the students who took the course in-person scored 35, 31, 29, 25. (There were actually a lot more students than this, but these will do to test with.)

Enter these data into two vectors called online and classroom respectively.

Solution

Use c:

online <- c(32, 37, 35, 28)
classroom <- c(35, 31, 29, 25)
online

[1] 32 37 35 28

classroom

[1] 35 31 29 25

If you are careful you can copy and paste my values rather than typing them in again.

$\blacksquare$

Using the two vectors you just made, make a dataframe with those two vectors as columns, and rearrange it to be a suitable input for t.test. Hint: you might find everything() useful in your rearrangement. When you have everything suitably rearranged, run a two-sample $t$-test (two-sided, Welch). What P-value do you get?

Solution

tibble to make the dataframe, and pivot_longer to rearrange the values (they start out as “wide” and you want to make them longer):

tibble(online, classroom) %>% 
  pivot_longer(everything(), names_to = "location", values_to = "score") -> d
d

This is the right layout now. And now to run the $t$-test on it. Two-sided and Welch are both the defaults, so we don’t need any extra input (I am making it easier for you):

t.test(score ~ location, data = d)


    Welch Two Sample t-test

data:  score by location
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group classroom and group online is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group classroom    mean in group online 
                     30                      33

The P-value is 0.3344. From these tiny samples, it is not at all impressive, but that’s not the point: it looks like the right sort of P-value from data like this.

The reason for the everything() is that it will pivot things longer no matter what they are called (or however many of them there are, but in this question there will only be two columns), and that will be helpful when you write a function and your columns have different names than here: you can still use everything().

It’s easier to save the output from pivot_longer and use that as input to the $t$-test (you will probably realize, as I did, that it would be helpful to call it something), but you can also do it all in one shot, by using the notation . for “the thing that came out of the previous step”:⁷

tibble(online, classroom) %>% 
  pivot_longer(everything(), names_to = "location", values_to = "score") %>% 
  t.test(score ~ location, data = .)


    Welch Two Sample t-test

data:  score by location
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group classroom and group online is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group classroom    mean in group online 
                     30                      33

The result is the same either way. But unless you remember what the “dot” thing does, it is easier to save the output from pivot_longer in something, and use that something as input to t.test.

Extra: instead of using tibble, you could use the base R data.frame, but I haven’t taught you that, so if you use it, you must say where you got it from.

$\blacksquare$

Write a function that takes two vectors as input. Call them x and y. The function should run a two-sample (two-sided, Welch) $t$-test on the two vectors as input and return the output from the $t$-test.

Solution

The reason I had you do the previous part was so that you have written everything that you need to put into your function. (This is a good strategy: get it working in ordinary code first. Then, when you write your function, there’s less to go wrong.)

t_test_wide <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d)
}

Change your inputs to tibble to be the inputs to the function, x and y. It’s better also to change the names_to and values_to to something more generic than location and score (or whatever you used before), because the function you are writing will apply to any two columns of input, not just test scores in two different classes, and future-you will appreciate knowing the roles of things in your function, ready for when future-you needs to change something about how the function works.

Also, whatever names you put in names_to and values_to need to be in t.test also, otherwise when you come to test it (in a moment) you won’t get the same results as you had before.

R style, and thus best practice here, is to have the last line of the function be the value that is returned from the function (here, the output from t.test, all of it). There is no need to wrap it in return; it works (unlike in Python) to do it without, and is better. Also, if you are thinking about Python, you might be wondering about the indentation. Python requires the indentation, but R doesn’t (because of the curly brackets around the function). Having said that, though, it’s good style to do the indentation as I did it (and also benefits future-you when you come back to your code in six months and have forgotten how you did it): indent each line of the function once (two spaces), and indent the pivot_longer part of the pipeline once more to make it clear that it belongs to the pipeline and the t.test does not.

$\blacksquare$

Test your function on the same data you used earlier, and verify that you get the same P-value.

Solution

Use classroom and online as the two inputs to the function. They can be either way around, since the test is two-sided:

t_test_wide(classroom, online)


    Welch Two Sample t-test

data:  value by group
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group x and group y is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group x mean in group y 
             30              33

The same P-value indeed, and everything else the same also.

If the P-value is not the same (or you get an error from your function), you will need to go back and fix up your function until it works. Debugging is a big part of coding. Most of us don’t get things right the first time. Within limits,⁸ it doesn’t matter how long it takes, as long as you get there by the time your work is due.

$\blacksquare$

Modify your function to return just the P-value from the $t$-test, as a number.

Solution

This means finding out how to get just the P-value out of the t.test output. It’s there on the output, but how do we get it out?

There are two approaches I would consider.

One is to find out all the things that are output from t.test. To do that, run your initial t.test again, but this time save the result:

d_t <- t.test(score ~ location, data = d)

and then take a look at the names of the result:

names(d_t)

 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

The third of those is the one we want. Get it out with a dollar sign, as if it were a column of a dataframe:⁹

d_t$p.value

[1] 0.3343957

This, I think, also works (in a rather more tidyverse fashion):¹⁰

d_t %>% pluck("p.value")

[1] 0.3343957

This function names has quietly shown up a couple of times in this course (once in combination with power.t.test and once when I showed you what augment did in regression).

I said there was a second way. We used this idea in regression, but you might think to use it in a $t$-test connection to see what it does. The idea comes from broom: what do glance and tidy do in the context of a $t$-test? Let’s find out:

glance(d_t)

and

tidy(d_t)

They are actually identical, and they both have a thing called p.value that is what we want. Another way to run these is in a pipeline, piping the output from t.test into (say) tidy, thus:

t.test(score ~ location, data = d) %>% 
  tidy()

This is a dataframe (the reason why broom is useful), so get just the P-value out of this the way you would get anything out of a dataframe, with a dollar sign (which looks a little odd):

t.test(score ~ location, data = d) %>% 
  tidy() %>% .$p.value

[1] 0.3343957

or with pluck, or for that matter pull:

t.test(score ~ location, data = d) %>% 
  tidy() %>% pluck("p.value")

[1] 0.3343957

select doesn’t quite work here, because it gives you back a dataframe and we want a number.

I also need to talk some more about the dollar sign in pipeline thing. The tidy() produces a dataframe, and we can use the “dot” to refer to “the dataframe that came out of the previous step” (the output from the tidy). So the odd code .$p.value means, in English, “the column called p.value in the dataframe that came out of the previous step”.

All right, with all of that in mind, there are (at least) two ways to modify your function to return just the P-value.

The first is to use the dollar-sign plus P-value (that we found using names):

t_test_wide <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t_d <- t.test(value ~ group, data = d)
  t_d$p.value
}

The second is to use tidy (or glance) from broom, something like this:

t_test_wide_x <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d) %>% 
    tidy() %>% pluck("p.value")
}

$\blacksquare$

Test your modified function and demonstrate that it does indeed return only the P-value (and not the rest of the $t$-test output).

Solution

I have two functions, so I have two tests (but you will only have one):

t_test_wide(classroom, online)

[1] 0.3343957

t_test_wide_x(classroom, online)

[1] 0.3343957

Success.

$\blacksquare$

What happens if you input two vectors of different lengths to your function? Explain briefly what happened. Does it still make sense to do a $t$-test with input vectors of different lengths? Explain briefly. Hint: if you plan to render your document, make sure the top line inside your code chunk says #| error: true.

Solution

Make up something, anything, eg. this:

t_test_wide(c(1,2), c(4,5,6))

Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 2: Existing data.
• Size 3: Column at position 2.
ℹ Only values of size one are recycled.

If we input two columns of different lengths, we get an error. You should get the same error, possibly with different numbers in it (depending how long your input vectors were).

The first sentence of the error message says “tibble columns must have compatible sizes”. This (one presumes) comes from the tibble line of your function; this is because each column of a dataframe has to be the same length.

The reason for the hint is that if you are not careful (and the hint tells you how to be careful), you will end up not being able to knit your document (because of the error). The hint tells you how to allow your document to continue knitting if there’s an error, and not to bail out at the first error it sees.

Aside: if you used data.frame above, you get a different error, but one with the same meaning.

The second part of the question is whether it makes sense to do a two-sample $t$-test with input columns of different lengths. The two-sample $t$-test works with two independent samples on two different groups of individuals, so there is no reason why the samples have to be the same size. (Think of our kids learning to read in class: the sample sizes there were 23 and 21, not the same.) So our function really ought to work for input vectors of different lengths.

Extra: I am not going to make you explore this further in the question, but if you want to think about how to do it: the problem is that tibble needs two vectors the same length, but logically the $t$-test does not. One way around this is to put NA values on the end of the shorter vector. You have probably run into length to find how many entries are in a vector:

u <- c(1, 2)
v <- c(4, 5, 6)
length(u)

[1] 2

length(v)

[1] 3

No surprises there. But you can also assign the length of a vector. Let’s do that in a way that we can use in our function:

n <- max(length(u), length(v))
n

[1] 3

length(u) <- n
length(v) <- n

What does it mean to “make u have length 3”, when we know it has length 2? Well, let’s see what happened to it:

[1]  1  2 NA

It now is of length 3, because it got an NA glued onto the end!

So does our function now work with this u and v?

t_test_wide(u, v)

[1] 0.02127341

It looks as if it does. Indeed, if you check it out, you’ll find that t.test ignores any missing values (according to the help, “Missing values are silently removed”), so this is a very smooth way around the problem of input vectors being of different lengths: put NAs on the end of the shorter one to make them the same length. We can put this at the top of our function, thus, with some comments to help future-me:

t_test_wide_q <- function(x, y) {
  # make x and y have the same length
  n <- max(length(x), length(y))
  length(x) <- n
  length(y) <- n
  # assemble into dataframe and run t-test
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t_d <- t.test(value ~ group, data = d)
  t_d$p.value
}

and then

t_test_wide_q(c(1, 2, 3, 4), c(4, 5, 6))

[1] 0.03464755

which works and looks plausible.

$\blacksquare$

Modify your function to allow any inputs that t.test accepts. Demonstrate that your modified function works by obtaining a pooled $t$-test for the test score data.

Solution

This is the business about passing input arguments on elsewhere without having to intercept them one by one, using ...:

t_test_wide <- function(x, y, ...) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d, ...) %>% 
    tidy() %>% pluck("p.value")
}

Put the ... at the end of the inputs to the function, and again at the end of the inputs to t.test (since that’s where anything unrecognized needs to be sent).

To test:

t_test_wide(classroom, online, var.equal = TRUE)

[1] 0.3342512

Our function does not have an input var.equal, so it gets passed on to t.test which does. The P-value is only very slightly different to the one we had before, though it is not exactly the same, which suggests that the spreads of scores in each class are about the same.

$\blacksquare$

Footnotes

In case you are curious: yes, it is possible for an adjusted R-squared to be negative even though R-squared itself cannot go below zero.↩︎
regsubsets has a way of doing its calculations without actually fitting the models, thanks to some cleverness with matrix algebra.↩︎
For example, a relationship that goes up and comes down again in a way that has no overall upward or downward trend.↩︎
This is the same thing I said in the Extra above.↩︎
Typing this all out again would be way too much work.↩︎
But I can’t remember whether I showed you this in lecture.↩︎
The reason you have to do it this way is that the first input to t.test is a “model formula” saying what depends on what, not a dataframe. If the dataframe were first, there would be no problems omitting the dataframe as usual, to mean “whatever came out of the previous step”.↩︎
There is never actually an infinite amount of time to do something, so you do have to know enough to get reasonably close reasonably quickly. The time limit on assignments is there for that reason.↩︎
It is actually an element of a list, but dataframes at their core are actually lists of columns, so the extraction process is the same.↩︎
pluck pulls anything out of anything else; it doesn’t have to be a dataframe.↩︎

Packages

Behavior Risk Factor Surveillance System 2

Veggie burgers

Construction projects

Writing a function to do wide \(t\)-test

Footnotes