STAC32 Assignment 8

Packages

library(tidyverse)
library(broom)

Cost of groceries

It is suspected that the more people there are in a household, the less the cost per person per week for groceries. To investigate this, a marketing institute randomly sampled 20 households, recording the number of people in the household and the cost per person of groceries for a week. The data are in http://ritsokiguess.site/datafiles/groceries.csv.

(1 point) Read in and display (some of) the data.

To no-one’s surprise:

my_url <- "http://ritsokiguess.site/datafiles/groceries.csv"
groceries <- read_csv(my_url)

Rows: 20 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): number, cost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

groceries

Calling the dataframe groceries seems to make sense, but you can use another reasonable name of your choosing.

Extra: this is the Family dataset from the BSDA package. I renamed it for you, because the experimental units were really households rather than families (unless you think a 1-person household is a “family”).

(2 points) Make an appropriate graph of the two variables in the dataframe. Do not add a trend (line or curve).

The two variables are both quantitative, so a scatterplot. As ever with a scatterplot, think about which variable is explanatory and which is response. Here, cost of groceries per person is suspected to depend on the number of people in the household, so the former is the response and the latter is explanatory:

ggplot(groceries, aes(x = number, y = cost)) + geom_point()

Extra: the reason I said not to add a trend is that if you add a smooth trend, which would otherwise be best, you get this:

ggplot(groceries, aes(x = number, y = cost)) + geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.975

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 2.025

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 3.9438e-17

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 1

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
0.975

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
2.025

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 3.9438e-17

Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 1

It works, but you get a lot of warnings (that I didn’t want to confuse you with). I think this is because there are only six different values of number (you can only have a whole number of people in a household), and the smooth trend likes to have more than that to work with, which is where all the warnings are coming from.

Having said that, the trend that we get does look pretty straight (next question).

It would be an error to add a regression line to this graph, because when we drew this graph, we didn’t know what the relationship looked like. After answering the next question, we could then re-draw the graph with a straight line on it, if we were then happy that the relationship was linear. (Adding a straight line to a scatterplot where the trend might not be straight is also deceiving, because it conditions the mind to believe that the relationship actually is straight, whether it is or not: that is to say, it makes you think that the relationship is straighter than it actually is).

(3 points) Describe any relationship you see on your graph (hint: form, direction, strength). Does your graph support the marketing institute’s suspicion? Explain briefly.

Form: linear, or very close to it. (It may not be quite clear that the linearity continues up to a household size of 6, but there is only one household that big, so the evidence against linearity is flimsy.)
Direction: downward
Strength: moderate, or moderate to strong, or something like that. (The downward trend is clear, but there is some variability in cost within households of the same number of people).

Two points for getting something like those. The last point is for addressing the institute’s suspicion, along with how you know. They suspected that the cost per person would decrease as the household size increased, and the downward trend supports that.

(2 points) Fit a regression for predicting cost per person from household size, and display the results.

This:

groceries.1 <- lm(cost ~ number, data = groceries)
summary(groceries.1)


Call:
lm(formula = cost ~ number, data = groceries)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5556 -2.4000 -0.3333  3.0278  5.4444 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  88.6444     1.6762  52.883  < 2e-16 ***
number       -4.0889     0.5511  -7.419 7.05e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.202 on 18 degrees of freedom
Multiple R-squared:  0.7536,    Adjusted R-squared:  0.7399 
F-statistic: 55.04 on 1 and 18 DF,  p-value: 7.051e-07

Give your fitted model a name that will remind you what it is. You could use the dataframe name and a number, the response variable name and a number, or some other system of your devising. Often, you will end up with several fitted models, and you want to be able to tell which one is which.

(2 points) What do you learn from the P-value for number in your output from the previous question?

This is testing the null hypothesis that the slope is zero (the trend is flat) against an alternative that the slope is not zero (there is a real trend). The P-value is \(7.05 \times 10^{-7}\), much less than 0.05, so we reject this null in favour of the alternative and conclude that there is a relationship between cost per person and household size for all households (of which the ones we have are a sample).

Make sure you get to something about grocery costs and household size. “Reject the null hypothesis” by itself is 0.5 if you are lucky. Remember that the purpose of a hypothesis test is to say something about a population, namely the population you think your sample was drawn from (which might be “all households” here).

This is the confirmation that the suspicion that the marketing institute had was actually reproducibly correct.

(2 points) Obtain two plots that will help you assess the appropriateness of the regression.

This means to get plots of:

residuals vs. fitted values
normal quantile plot of residuals:

ggplot(groceries.1, aes(x = .fitted, y = .resid)) + geom_point()

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
  Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.

and

ggplot(groceries.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

(2 points) Comment on the two plots you just obtained.

The plot of residuals against fitted values should be a random scatter of points, but it looks to me as if it “fans out”, that is, the predictions are less accurate when the fitted value is larger (that is, when the household size is smaller). This might be because averages based on a larger number of individuals (which is what the cost per person is) will tend to be estimated more accurately.
The normal quantile plot of residuals looks good; the only potential problem is short tails.

Extra: the usual thing to ask you to do after this is to fix up any problems you see, but it’s not so obvious here. The fanning-out is usually helped by a transformation that brings the larger fitted values closer together, like log or square root. I thought I would begin with Box-Cox:

library(MASS, exclude = "select")
boxcox(cost ~ number, data = groceries, lambda = c(-10, 2, 0.1))

I had to enlarge the scale so that you could see the whole picture: the transformation could be anything between log and power \(-7\)!

We can try log, on the basis that it should do the right kind of thing:

groceries.2 <- lm(log(cost) ~ number, data = groceries)
summary(groceries.2)


Call:
lm(formula = log(cost) ~ number, data = groceries)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.06985 -0.02865 -0.00437  0.03374  0.06051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.492683   0.020637 217.698  < 2e-16 ***
number      -0.053384   0.006785  -7.867 3.11e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03942 on 18 degrees of freedom
Multiple R-squared:  0.7747,    Adjusted R-squared:  0.7622 
F-statistic:  61.9 on 1 and 18 DF,  p-value: 3.107e-07

This is actually very similar to what we had before: similar R-squared, similar P-value for the slope, similar downward trend. But the important question is whether the residual plot fans out any less:

ggplot(groceries.2, aes(x = .fitted, y = .resid)) + geom_point()

I think this is an improvement, though maybe not as much of one as you would like. You could, if you wanted, also try reciprocal (power \(-1\)), and see whether you like that any better.

Extra 2: the number of people in a household must be a whole number, and there are (usually) several households with the same number of people in them, so this, like the example on the worksheet, is also a candidate for a lack of fit test. We’ll do it on our first regression (without a transformation), though you could also do it after transforming cost. The procedure is the same as for the wear loss data on the worksheet:

library(alr3)

Loading required package: car

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some


Attaching package: 'alr3'

The following object is masked from 'package:MASS':

    forbes

pureErrorAnova(groceries.1)

There is no evidence at all of lack of fit here, with a P-value of 0.71. That is to say, the variability in cost is large enough (in households that have the same number of people) that there is no evidence for anything other than a straight line model here. (This is sort-of consistent with our other Extra, in which we decided that it was very difficult to find a good transformation.)

Employment growth

What other factors does the rate of employment growth depend on? In a 1994 study of the area around Hannover, Germany, an economist measured these variables:

y: rate of employment growth (response, a percentage)
PA: percent of people employed in production activities (such as manufacturing)
GPA: growth in PA from one time Period to the next (percent)
HS: percent of people employed in higher services (not directly involved in manufacturing something)
GHS: growth in HS from one time Period to the next (percent)
Period: time period: 1 (1979-1982), 2 (1983-1988), 3 (1989-1992). We treat this as quantitative.

These variables were measured for each of 21 Regions in the city of Hannover and the surrounding area. We do not use Region in our analysis.

The variable y can be positive (an increase in employment) or negative (a decrease). The data are in http://ritsokiguess.site/datafiles/hannover_growth.csv.

(1 point) Read in and display (some of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/hannover_growth.csv"
growth <- read_csv(my_url)

Rows: 63 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): Region, PA, GPA, HS, GHS, y, Period

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

growth

with the variables promised. There are three time periods, and the values of the variables were obtained once for each time period in each region, hence there are \(3 \times 21 = 63\) observations. (These are therefore strictly repeated measures, a fact we ignore in our analysis.)

Extra: these are the wagnerGrowth data from the robustbase package:

growth0 <- robustbase::wagnerGrowth
growth0

In the original data, Period is actually a factor (categorical), which would impact the analysis (and, I don’t know whether I will have reached “regression with categorical variables” in lecture by the time you need to do this). I figured that looking for a time trend is enough for us, so we can treat time period as quantitative:

growth0 %>% mutate(Period = as.numeric(Period)) -> growth0
growth0

and that is the data set I saved for you.

These data are, as I say, from Hannover in Germany, not to be confused with Hanover in New Hampshire, where the Ivy League school Dartmouth College is. They originally came from a paper with the catchy title of “Regionale Beschäftigungsdynamik und höherwertige Produktionsdienste: Ergebnisse für den Grossraum Hannover (1979-1992)”, which translates to “regional employment¹ dynamics and higher-valued production services: results for Greater Hannover”, where the “greater”, like Greater Toronto, means the city and the surrounding area. It is not clear to me whether the 21 regions are all the regions in and around Hannover, or just a sample of them (which would affect the inferences you make: what are these regions at these times a sample of?)

(2 points) Fit a regression predicting rate of employment growth from the other variables of interest (that is, including Period but not including Region). Display the results of the regression.

This:

growth.1 <- lm(y ~ PA + GPA + HS + GHS + Period, data = growth)
summary(growth.1)


Call:
lm(formula = y ~ PA + GPA + HS + GHS + Period, data = growth)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.1086  -4.1808   0.7785   4.5571  16.7734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.56994    8.20119   0.679    0.500    
PA          -0.31284    0.18875  -1.657    0.103    
GPA          0.07293    0.43443   0.168    0.867    
HS          -0.65789    0.92037  -0.715    0.478    
GHS          3.58862    1.45897   2.460    0.017 *  
Period       6.20205    1.30090   4.768 1.33e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.547 on 57 degrees of freedom
Multiple R-squared:  0.4725,    Adjusted R-squared:  0.4262 
F-statistic: 10.21 on 5 and 57 DF,  p-value: 5.084e-07

If you prefer (and you have loaded broom), do the first line and then do these, in either order:

tidy(growth.1)

glance(growth.1)

The information in these two (respectively intercept and slopes, and things like R-squared and the overall \(F\)-test) is the same as the information in summary, so either way, summary or tidy plus glance, is good.

Marking note: if you make a mistake here, that may affect your answers the rest of the way. The marker will try to give you credit for later answers that would have been correct except for your error here, but the marker has only a limited amount of time and energy to expend on trying to decide whether you have made any additional errors, so this is not a guarantee.

Extra: the R-squared is not especially high (even though some of the explanatory variables are significant). This is typical of real-life data like this, where there are all sorts of effects that are hard to quantify (and there just is what economics people call “volatility”). There might also be differences among the regions, above and beyond any differences they have in the variables that were measured, but I didn’t want to have you get into that: Region can only reasonably be treated as categorical, and there are a lot of regions (you would get one effect for each Region except for the first one which is the baseline). So I had you lump all the regions together, thinking of a region at a certain time period as being one observation (hopefully) independent of the others.

(2 points) What is the value of Estimate for Period? Interpret this number in the context of the data.

6.20 (rounded suitably). This says that if you are looking at one time period later, but all the other explanatory variables are the same (PA, GPA, HS, GHS do not change), the rate of employment growth is predicted to increase by 6.20 percent.

Remember that when you interpret a slope in a multiple regression, this tells you that if this variable increases by one, and if all the other explanatory variables stay the same, then the response variable will increase by whatever the slope is. You need both “if”s; if we look at one time period later, and other variables change, then the slope of Period does not tell us what is going on (we have an actual calculation to do).

Extra: the Latin term for this is “ceteris paribus”, which translates to “all else equal”. Wikipedia article. The problem in economics (as here) and, for that matter, other sciences, is that other things are usually not equal, and it can be very hard to disentangle what is going on. Economists have long used “ceteris paribus” as a way to understand market mechanisms, such as “other things equal, if the supply of beef decreases, then the price of beef increases”. If you use this principle, you have to be careful: in the UK in about 2001, there was an outbreak of “mad cow disease” and the supply of beef (from uninfected cows) went way down, but the price did not go up as much as expected, because a lot of people stopped buying beef as a result of the disease, and would not have bought beef no matter how cheap it was.

Multiple regression is a great way to understand the individual effects of other variables ceteris paribus, because that is literally what regression slopes are. Whether an explanatory variable in a multiple regression is significant (as Period is here) tells you whether it has an effect on the response over and above any effects of the other explanatory variables: that is, whether it has anything to add (and should be kept) or not (and should be removed, but we come to that later). Sometimes you hear this expressed as “adjusting for the effect of” whatever the other variables are; the fact that this regression has GHS in it means that we are looking at the effect of, say, Period while allowing for the fact that GHS will change from one time period to another.

(5 points) By making a suitable collection of plots, demonstrate that there are no substantial problems with this regression.

The suitable collection of plots is:

residuals vs fitted values (1 point)
normal quantile plot of residuals (1 point)
residuals vs explanatory variables, all in one facetted plot (2 points).

(The last point is for a suitable commentary: we get to that later.)

Think ahead to the last one: this will need you to use augment (from broom) to get everything into one dataframe, so you would do well to deal with this first:

growth.1a <- augment(growth.1)
growth.1a

This has the original data plus the stuff from the regression, that is to say, everything you need for all the plots, so you can use this dataframe the rest of the way.

Residuals vs. fitted:

ggplot(growth.1a, aes(x = .fitted, y = .resid)) + geom_point()

Normal quantile plot of residuals:

ggplot(growth.1a, aes(sample = .resid)) + stat_qq() + stat_qq_line()

To plot residuals vs. each \(x\), the first thing to realize is that this plot is much easier to make if you have all the \(x\)-variable values in one column, with a separate column saying which \(x\)-variable the value is for. I actually did another step before that, which was to select only the variables that were going to feature in the plot:

growth.1a %>% 
  select(PA:Period, .resid) %>% 
  pivot_longer(PA:Period, names_to = "xname", values_to = "xval")

Now I can make the plot by plotting .resid against xval, facetting by xname, and remembering to give each facet its own scale, so this is the code you need, possibly without the second line:

growth.1a %>% 
  select(PA:Period, .resid) %>% 
  pivot_longer(PA:Period, names_to = "xname", values_to = "xval") %>% 
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
  facet_wrap(~ xname, scales = "free")

As for commentary: remember that you are trying to say how you know the regression is more or less satisfactory (that there are “no substantial problems” with it), so you need to say something along these lines:

the residuals vs. fitted plot is a random scatter of points
the residuals have close to a normal distribution (the points are mostly along the line), except possibly for the lower tail
the residuals against the explanatory variables are random scatters of points (for the four explanatory variables), and against Period, the three (vertical) scatters of residuals are more or less equally distributed about zero. If you wanted to be critical, the only problem I can see on here is that the residuals for period 2 go slightly more positive than those for the other two periods.²

(2 points) Build a better model (that is, make a better choice of explanatory variables), showing your process.

This means to remove any explanatory variables that are not contributing to the model. Remember that the P-values of the slopes tell you what would happen if you remove just that one explanatory variable, so you have to remove the least significant one, see what happens, and try again.

With that in mind:

summary(growth.1)


Call:
lm(formula = y ~ PA + GPA + HS + GHS + Period, data = growth)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.1086  -4.1808   0.7785   4.5571  16.7734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.56994    8.20119   0.679    0.500    
PA          -0.31284    0.18875  -1.657    0.103    
GPA          0.07293    0.43443   0.168    0.867    
HS          -0.65789    0.92037  -0.715    0.478    
GHS          3.58862    1.45897   2.460    0.017 *  
Period       6.20205    1.30090   4.768 1.33e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.547 on 57 degrees of freedom
Multiple R-squared:  0.4725,    Adjusted R-squared:  0.4262 
F-statistic: 10.21 on 5 and 57 DF,  p-value: 5.084e-07

remove GPA first, and refit. You can copy-paste your code from growth.1, change the model number and remove what you want to remove. You can also use tidy instead of summary, although you might then have to deal with P-values in scientific notation:

growth.2 <- lm(y ~ PA + HS + GHS + Period, data = growth)
summary(growth.2)


Call:
lm(formula = y ~ PA + HS + GHS + Period, data = growth)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.3954  -4.1475   0.8155   4.6189  16.6693 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.6339     8.1234   0.694   0.4907    
PA           -0.3212     0.1805  -1.779   0.0805 .  
HS           -0.6586     0.9126  -0.722   0.4734    
GHS           3.5678     1.4415   2.475   0.0163 *  
Period        6.2464     1.2631   4.945 6.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.483 on 58 degrees of freedom
Multiple R-squared:  0.4722,    Adjusted R-squared:  0.4358 
F-statistic: 12.97 on 4 and 58 DF,  p-value: 1.313e-07

HS has the largest P-value among the actual explanatory variables, so it comes out first.

Aside: if you use tidy, you get this:

tidy(growth.2)

The story is the same (remove HS), though how you get there (with the scientific notation) is different: look for the least negative power of 10 (\(10^{-1}\)), and if there is more than one explanatory variable with that power of 10, pick the one with the biggest multiplier of that power of 10. Here, there is only one, HS (the intercept is not eligible to be removed). Then check that you are not accidentally removing a significant explanatory variable. This P-value is actually 0.47, so it is not significant and can be removed. In scientific notation, 0.05 is \(5 \times 10^{-2}\) or 5e-02.

The technical term for the power of 10 is the “exponent” and for the multiplier of that power of 10 is the “mantissa”. See Wikipedia for example.

End of aside.

Now remove HS and see whether PA should stay:

growth.3 <- lm(y ~ PA + GHS + Period, data = growth)
summary(growth.3)


Call:
lm(formula = y ~ PA + GHS + Period, data = growth)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.660  -4.324   0.839   5.014  17.224 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.4337     6.7786   0.359   0.7209    
PA           -0.2659     0.1628  -1.633   0.1078    
GHS           3.6146     1.4341   2.520   0.0144 *  
Period        6.0412     1.2257   4.929 7.06e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.453 on 59 degrees of freedom
Multiple R-squared:  0.4675,    Adjusted R-squared:  0.4404 
F-statistic: 17.27 on 3 and 59 DF,  p-value: 3.65e-08

PA’s P-value actually went up, so it comes out too:

growth.4 <- lm(y ~ GHS + Period, data = growth)
summary(growth.4)


Call:
lm(formula = y ~ GHS + Period, data = growth)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.1858  -4.2081   0.5336   5.2638  18.5170 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -7.859      2.529  -3.107  0.00289 ** 
GHS            4.127      1.419   2.909  0.00508 ** 
Period         6.617      1.190   5.560 6.59e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.556 on 60 degrees of freedom
Multiple R-squared:  0.4434,    Adjusted R-squared:  0.4249 
F-statistic:  23.9 on 2 and 60 DF,  p-value: 2.322e-08

and now everything is significant and we are done.

If you are using tidy, this is what you see at this point:

tidy(growth.4)

The largest P-value is the one with e-03 on the end (smallest negative power of 10), namely GHS. \(5.1 \times 10^{-3} = 0.0051\) which is smaller than 0.05, so we cannot remove GHS. Or, the P-value of GHS is smaller than 0.05 because 5.08e-03 is smaller than 5e-02 (it has a more negative power of 10).

The rate of employment growth depends on growth in people employed in higher services, and the time period (and not on how many people are employed in production activities or the growth in that). Both Estimates are positive, so a higher value of either of these two variables (a more recent Period) indicates a higher rate of employment growth.

Extra: this last regression is the one you would actually use to predict rate of employment growth, because it only contains variables that really do affect employment growth (over and above whatever else remained in the model). Before you do that, however, you would probably look again at residual plots. I didn’t ask you to do this, partly for a technical reason (which I will explain) and partly because it repeats what you did before. My suspicion is that taking out those explanatory variables will not have introduced any additional problems, and the residual plots you got before were basically satisfactory.

In any case, the starting point is likely to be augment again, using what you now realize to be the best model, but:

augment(growth.4)

the problem now is that when you use augment, it only uses the variables from the original data frame that you kept in the model (in this case, only GHS and Period). This is normally what you want, and why augment does it this way, but when we draw residual plots, we want to plot the residuals against all the explanatory variables in the original dataframe, in case we have missed something about them (such as, they have a curved relationship with the response that didn’t make it into our regression). What you do in this case is to “augment the regression with the data”, like this:

augment(growth.4, data = growth)

which makes sure that you get the regression stuff plus all the original data. Then you can go on to plot residuals against fitted values, make a normal quantile plot of the residuals, and plot the residuals against each of the original explanatory variables. Augmenting the regression object with the data means that there is now no problem with doing this.

I suppose I now have to demonstrate:

growth.4a <- augment(growth.4, data = growth)
ggplot(growth.4a, aes(x = .fitted, y = .resid)) + geom_point()

ggplot(growth.4a, aes(sample = .resid)) + stat_qq() + stat_qq_line()

growth.4a %>% 
  select(PA:GHS, Period, .resid) %>% 
  pivot_longer(-.resid, names_to = "xname", values_to = "xvalue") %>% 
  ggplot(aes(x = xvalue, y = .resid)) + geom_point() + 
    facet_wrap(~ xname, scales = "free")

and I think I was right: I don’t think we have introduced any problems that weren’t there before.

(1 point) Are both employment types represented in the final model, or not? Explain briefly.

The two employment types (from the data description) are Production Activities (PA and GPA) and Higher Services (HS and GHS). Our final model includes only GHS along with Period, so only Higher Services are represented in the final model.

That is to say, knowing about Production Activities, or the growth in them, does not help us predict the overall rate of employment growth (at least, not in Hannover, Germany, in these years or years economically like them).

Points: \((1+2+3+2+2+2+2) + (1+2+2+5+2+1) = 14+13 = 27\).

Footnotes

Beschäftig is the usual German word for ”busy”, and thus Beschäftigung is the noun from that, literally busy-ness, which looks an awful lot like ”business”.↩︎
This might be a small indication that the dependence on Period is not quite linear.↩︎