STAC32 Assignment 8

Packages

library(tidyverse)
library(broom)

Rocket science

The data in http://ritsokiguess.site/datafiles/supersonic.csv are from a study of the relation between thrust efficiency (thrust_eff) of supersonic propelling rockets (response) and the half-divergence angle (angle) of the rocket nozzle (explanatory). I know nothing more about these except that this is literally rocket science! The aim is to find a suitable model for the relationship between these two variables.

(1 point) Read in and display (all of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/supersonic.csv"
supersonic <- read_csv(my_url)

Rows: 7 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): angle, thrust_eff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

supersonic

There are only seven observations.

Something like rockets would also be a good name for the dataframe.

(2 points) Make a suitable graph of the two variables, adding a smooth trend.

This is a scatterplot with a smooth trend, using the response thrust_eff on the \(y\)-axis:

ggplot(supersonic, aes(x = angle, y = thrust_eff)) + geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Keep the grey envelope or not as you wish (using se = FALSE), whichever you think is clearer.

(2 points) Why does it make sense to fit a model that includes angle-squared to these data?

The relationship appears to go up then down again, in the same sort of way that a parabola does.

“The relationship is non-linear” does not show enough understanding. A non-linear relationship could be like this, or it could be exponential growth or decay, or it could be increase towards an upper limit, or any number of other things. Each of those would need to be treated differently (which is why linear models are much easier to deal with than non-linear ones).

Extra: if you look again at the smooth trend, the thrust efficiency at angle 20 appears to be a bit lower than you would expect for a parabola. Another way to see this is that the smooth trend “opens downward” up to an angle of about 18, then it opens slightly upward up to an angle of about 22 (look carefully!), and then it opens downward again. A parabola only opens one way (downward if the coefficient of \(x^2\) is negative, upward if it is positive), so it might be possible to improve on the model-fitting. Of course, it is also possible that the thrust efficiency at 20 is low just by chance and a parabola is perfectly good. If you eyeball it, you might guess that a thrust efficiency of about 0.97 at angle 20 is about what you would expect from a parabola, and that is within the grey envelope around the smooth trend (it goes almost up to 0.975). So chance certainly might be a possible explanation for what we saw.

We come back to this later.

(2 points) Fit a model that includes a squared term in angle, and display the output.

supersonic.1 <- lm(thrust_eff ~ angle + I(angle^2), data = supersonic)
summary(supersonic.1)


Call:
lm(formula = thrust_eff ~ angle + I(angle^2), data = supersonic)

Residuals:
         1          2          3          4          5          6          7 
-0.0050238  0.0060714  0.0057143 -0.0050952 -0.0043571  0.0009286  0.0017619 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.826e-01  9.340e-03 105.206 4.89e-08 ***
angle        2.245e-03  1.070e-03   2.097  0.10395    
I(angle^2)  -1.510e-04  2.616e-05  -5.771  0.00447 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.005993 on 4 degrees of freedom
Multiple R-squared:  0.9874,    Adjusted R-squared:  0.9811 
F-statistic: 156.8 on 2 and 4 DF,  p-value: 0.0001585

Don’t forget to include an angle term as well, to get the most general parabola (as we did with the windmill data in lecture).

This actually fits extremely well, to judge by R-squared, and the squared term is definitely significant. (The term in angle is actually not significantly different from zero, but standard practice is to leave it in the model when a higher-order term is in there. When we add a cubed term later, the angle term becomes almost significant.)

(2 points) To your scatterplot (copy your code from before), add the fitted values from your regression, joined by lines.

This is similar to, but simpler than, the plot from lecture with the fitted line and curve on it (for the windmill data). The idea is to realize that the fitted values come from a different place than the rest of the data: they are in the fitted model object supersonic.1, so your geom_line needs to have a data = and an aes in it to reflect that plus the fact that the \(y\)-coordinates of the thing you’re plotting are called .fitted (the \(x\)-coordinates are still angle, so you don’t need to mention that):

ggplot(supersonic, aes(x = angle, y = thrust_eff)) + geom_point() +
  geom_line(data = supersonic.1, aes(y = .fitted))

The points here are for the geom_line part, since you did the scatterplot before.

Extra: another way you might approach this is to make a dataframe with everything you need, and then plot that, like this:

supersonic %>% 
  mutate(fits = fitted(supersonic.1)) %>% 
  ggplot(aes(x = angle, y = thrust_eff)) + 
    geom_point() +
    geom_line(aes(y = fits))

Yet another way you might go is to use augment from the broom package to make the dataframe with the data and fitted values in it:

supersonic.1 %>% augment(supersonic)

and then plot that:

supersonic.1 %>% augment(supersonic) %>% 
  ggplot(aes(x = angle, y = thrust_eff)) + 
    geom_point() +
    geom_line(aes(y = .fitted))

Find a way to get to that plot via something we’ve seen in lecture.

(2 points) Does there seem to be a systematic way in which the data deviate from the fitted parabola curve? Explain briefly.

I would say that the parabola is too high for the first, fourth, and fifth points (reading from the left), and too high for the second and third points. In other words, it doesn’t go up fast enough, and it doesn’t come down fast enough either. The suggestion is that we might have the wrong kind of curve.

(2 points) The rocket scientists who did the study suspect that the relationship actually bends twice, while a parabola only bends once. Add a cubed term to your regression, and display the results.

Copy, paste, and add:

supersonic.2 <- lm(thrust_eff ~ angle +  I(angle^2) + I(angle^3), data = supersonic)
summary(supersonic.2)


Call:
lm(formula = thrust_eff ~ angle + I(angle^2) + I(angle^3), data = supersonic)

Residuals:
         1          2          3          4          5          6          7 
-0.0013571  0.0024048  0.0020476 -0.0050952 -0.0006905  0.0045952 -0.0019048 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.606e-01  1.330e-02  72.201 5.86e-06 ***
angle        7.256e-03  2.685e-03   2.703   0.0736 .  
I(angle^2)  -4.443e-04  1.510e-04  -2.942   0.0604 .  
I(angle^3)   4.889e-06  2.494e-06   1.960   0.1449    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.004583 on 3 degrees of freedom
Multiple R-squared:  0.9945,    Adjusted R-squared:  0.989 
F-statistic: 180.1 on 3 and 3 DF,  p-value: 0.0006954

(2 points) Out of your two models, which one do you prefer? Explain briefly.

The best answer is to look at the P-value of the cubed term, 0.1449. This is not less than 0.05, so it is not significant, and therefore the cubed term does not significantly improve the fit of the model, and we should prefer the first model that only goes as far as angle^2. (This is Occam’s Razor in disguise: we don’t need the cubed term, so we should get rid of it; it has nothing to add.)

This is a better answer than comparing R-squared, for a couple of reasons:

when you add something to a regression, you will always increase R-squared, even if the thing you are adding is worthless.
if you want to compare R-squared values, you have to ask yourself “has R-squared increased substantially?”, and the increase from 0.9874 to 0.9945 isn’t very big (they are both almost perfect fits).
in your second course, you might have learned about “adjusted R-squared”, for which the “adjusted” part is “adjusting for the number of explanatory variables in the model”. That you can compare between your two models: 0.9811 and 0.989. There is almost no difference between these, so no reason to prefer the cubic model over the quadratic one. Invoke Occam’s Razor again to say that you should only prefer the cubic model if it is noticeably better than the quadratic one, which it is not.

Extra: if you look at the scatterplot again, it does appear that it might bend twice, but the problem is that we only have 7 observations, and so it will be very difficult to prove that the cubic fits better than the quadratic. If we had more data (say, replicate measurements at those angles, or measurements at more different angles), we might be able to make a decision that the relationship really needs to be modelled as a cubic, but with these data, we are not at that point.

We can add the cubic fit to the graph we drew before:

ggplot(supersonic, aes(x = angle, y = thrust_eff)) + geom_point() +
  geom_line(data = supersonic.1, aes(y = .fitted)) +
  geom_line(data = supersonic.2, aes(y = .fitted), colour = "red")

It does look as if the cubic (in red¹) captures the shape of the relationship better than the quadratic (in black), but the difference is not significant because we only have seven observations. It does look as if it would be worth collecting more data, though.

Another approach you might take is to make a dataframe with everything you want in it:

supersonic %>% 
  mutate(quadratic = fitted(supersonic.1),
         cubic = fitted(supersonic.2)) -> d
d

and then, as for the windmill example in lecture (where we plotted the linear, parabola, and asymptote models on one graph), we realize that we need all the fitted values in one column, so we have to pivot-longer first:²

d %>% 
  pivot_longer(quadratic:cubic, names_to = "model",
               values_to = "fit") %>% 
  ggplot(aes(x = angle, y = thrust_eff)) + geom_point() +
  geom_line(aes(y = fit, colour = model))

The graph is the same, just coloured differently. There is no data = here, because the fitted values and data are now all in one dataframe.

Setting concrete

Concrete is made by mixing sand, gravel, cement, and water. When mixed, it is liquid and can be poured into place, but over time, it hardens into something strong that can support weight. During the hardening process, the concrete gives off heat, and it is suspected that the amount of heat given off depends on the formulation of the cement used, in particular on the percentage by weight of four different chemicals, labelled A, B, C, and D below:

A: the percentage by weight of tricalcium aluminate
B: the percentage by weight of tricalcium silicate
C: the percentage by weight of tetracalcium alumino ferrite
D: the percentage by weight of dicalcium silicate
Heat: the heat given in calories per gram of cement (response)

The data are in http://ritsokiguess.site/datafiles/GLMsData_setting.csv.

(1 point) Read in and display (some of) the data.

Surprisingly enough (joking):

my_url <- 'http://ritsokiguess.site/datafiles/GLMsData_setting.csv'
setting <- read_csv(my_url)

Rows: 13 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): A, B, C, D, Heat

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

setting

There are 13 observations.

Extra 1: presumably a standard amount of sand, gravel, and water was used each time, so that the only thing different between the concrete samples was the composition of the cement.

Extra 2: If you’ve done physics, you might be used to joules as a unit of heat (or energy or work). This is the SI unit; the data above are American, so an old-fashioned unit is used: one calorie is the amount of heat needed to raise the temperature of one millilitre of water by one degree Celsius. Food calories are 1000 times these calories (replace one millilitre by one litre, and can be thought of as the amount of energy needed to metabolize food). A joule is the amount of work done by a force of one newton to move a mass one metre, and about 4 joules is 1 (small) calorie.

(2 points) Fit a regression predicting heat given off, from the four percentages by weight of the different chemicals, and display the results.

setting.1 <- lm(Heat ~ A + B + C + D, data = setting)
summary(setting.1)


Call:
lm(formula = Heat ~ A + B + C + D, data = setting)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1750 -1.6709  0.2508  1.3783  3.9254 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  62.4054    70.0710   0.891   0.3991  
A             1.5511     0.7448   2.083   0.0708 .
B             0.5102     0.7238   0.705   0.5009  
C             0.1019     0.7547   0.135   0.8959  
D            -0.1441     0.7091  -0.203   0.8441  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared:  0.9824,    Adjusted R-squared:  0.9736 
F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

Give your model a number, because there will be more models later.

(2 points) Something surprising has happened. What, and why? (Hint: look at R-squared and the P-values for the slopes.)

The R-squared is very high (close to a perfect fit), and to go with this you would expect to see at least one strongly significant explanatory variable, but none of them are significant!

Extra: The cause of this is “multicollinearity”. This is a problem that comes up in multiple regression: having explanatory variables correlated with the response is a good thing, but having explanatory variables correlated with each other is very confusing, because it makes it hard to see which of them deserve the credit for predicting the response.

How do I know these explanatory variables are correlated? One way is to work out the correlations:

cor(setting)

              A          B          C          D       Heat
A     1.0000000  0.2285795 -0.8241338 -0.2454451  0.7307175
B     0.2285795  1.0000000 -0.1392424 -0.9729550  0.8162526
C    -0.8241338 -0.1392424  1.0000000  0.0295370 -0.5346707
D    -0.2454451 -0.9729550  0.0295370  1.0000000 -0.8213050
Heat  0.7307175  0.8162526 -0.5346707 -0.8213050  1.0000000

B and D have a very strong negative correlation, and the correlation between A and C is almost as strong. The reason for these strong correlations is this:

setting %>% mutate(total = A + B + C + D)

The four percentages add up to near 100 (that is to say, the cement is almost completely composed of these four chemicals in some proportions), so that once you know three of the values, you can pretty much work out the fourth one.

(3 points) Carry out a backward elimination: that is, remove non-significant explanatory variables as needed.

This is where we were at:

setting.1 <- lm(Heat ~ A + B + C + D, data = setting)
summary(setting.1)


Call:
lm(formula = Heat ~ A + B + C + D, data = setting)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1750 -1.6709  0.2508  1.3783  3.9254 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  62.4054    70.0710   0.891   0.3991  
A             1.5511     0.7448   2.083   0.0708 .
B             0.5102     0.7238   0.705   0.5009  
C             0.1019     0.7547   0.135   0.8959  
D            -0.1441     0.7091  -0.203   0.8441  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared:  0.9824,    Adjusted R-squared:  0.9736 
F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

The least significant explanatory variable is C, so remove that first:

setting.2 <- lm(Heat ~ A + B + D, data = setting)
summary(setting.2)


Call:
lm(formula = Heat ~ A + B + D, data = setting)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0919 -1.8016  0.2562  1.2818  3.8982 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  71.6483    14.1424   5.066 0.000675 ***
A             1.4519     0.1170  12.410 5.78e-07 ***
B             0.4161     0.1856   2.242 0.051687 .  
D            -0.2365     0.1733  -1.365 0.205395    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared:  0.9823,    Adjusted R-squared:  0.9764 
F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

A is now suddenly very significant.³

The least significant remaining explanatory variable is D, so it comes out next (copy, paste, edit again):

setting.3 <- lm(Heat ~ A + B, data = setting)
summary(setting.3)


Call:
lm(formula = Heat ~ A + B, data = setting)

Residuals:
   Min     1Q Median     3Q    Max 
-2.893 -1.574 -1.302  1.363  4.048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
A            1.46831    0.12130   12.11 2.69e-07 ***
B            0.66225    0.04585   14.44 5.03e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared:  0.9787,    Adjusted R-squared:  0.9744 
F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

and now A and B are both strongly significant, so here we stop.

Feel free to use tidy from broom instead of summary for this question.

Extra: note that R-squared has barely changed among the three regressions, so that removing C and D has done effectively zero damage to the fit. But if you were to remove say A now, you would catastrophically destroy R-squared:

setting.4 <- lm(Heat ~ B, data = setting)
summary(setting.4)


Call:
lm(formula = Heat ~ B, data = setting)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.752  -6.008  -1.684   3.794  21.387 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  57.4237     8.4906   6.763  3.1e-05 ***
B             0.7891     0.1684   4.686 0.000665 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.077 on 11 degrees of freedom
Multiple R-squared:  0.6663,    Adjusted R-squared:  0.6359 
F-statistic: 21.96 on 1 and 11 DF,  p-value: 0.0006648

(2 points) For one of the variables remaining in your best model, interpret the number in its row of the Estimate column in your output.

Let me just grab the output again (you don’t need to):

summary(setting.3)


Call:
lm(formula = Heat ~ A + B, data = setting)

Residuals:
   Min     1Q Median     3Q    Max 
-2.893 -1.574 -1.302  1.363  4.048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
A            1.46831    0.12130   12.11 2.69e-07 ***
B            0.66225    0.04585   14.44 5.03e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared:  0.9787,    Adjusted R-squared:  0.9744 
F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

Pick one. It doesn’t matter which. For example, the Estimate (slope) for A is 1.47, which means that if the cement contains one more percent of A with everything else equal, there would be 1.47 more units of heat generated.

Extra: the “everything else equal” is important, because if you change the amount of A and change something else, it’s not obvious what will happen. The other question is whether increasing the amount of A leaving everything else the same really means anything at all, because we saw earlier that the four percentages add up to near 100, and hence if you have more A, you are bound to have less of the other things. That issue we leave aside.

(2 points) Plot residuals against fitted values and obtain a normal quantile plot of the residuals, from your best model. Do you see any problems?

There is nothing new here:

ggplot(setting.3, aes(x = .fitted, y = .resid)) + geom_point()

There seems to be no pattern here, so no problems. (If you look for more than 5 seconds, you will undoubtedly see a pattern of points across the top and across the bottom, but there are no problems with that.)

Then, in the usual way:

ggplot(setting.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()

I don’t see enough of a pattern here to be worried about, though there is a “gap” in the middle of the residuals: seven clearly negative ones, six clearly positive ones, and none close to zero. The kind of thing I am worried about here is outliers, long tails, or skewness, and we don’t have any of those.

(3 points) Plot the residuals against each of the original explanatory variables, using your best model. For maximum credit, do this in one plot.

The idea is to put each explanatory variable in a facet. The first thing, though, is to put the residuals and the original data in the same dataframe. I like to use augment from the broom package for this:

setting.3 %>% augment(setting)

and then it’s the same old problem of wanting A through D all in one column to make the plots (scroll right to see what happened):

setting.3 %>% augment(setting) %>% 
  pivot_longer(A:D, names_to = "xname", values_to = "xval")

and then make the plot, plotting residual against what I called xval, facetted by xname:

setting.3 %>% augment(setting) %>% 
  pivot_longer(A:D, names_to = "xname", values_to = "xval") %>% 
  ggplot(aes(x = xval, y = .resid)) + geom_point() +
    facet_wrap(~ xname, scales = "free")

I think the scales = "free" is best, because the four explanatory variables are on (slightly) different scales.

You certainly don’t need to do it in stages like I did. It probably helps to run it in stages as you’re trying to figure out what to do, but my last code and output is all you need.

I don’t see any sort of pattern beyond randomness in any of these. Make this comment if you like, but I didn’t ask for it here because the assignment is already long enough.

If you didn’t figure out this, make a dataframe with residuals and the original data in it (eg. with augment), and make the plots one at a time, for example this:

setting.3 %>% augment(setting) -> d
d

and then eg.

ggplot(d, aes(x = A, y = .resid)) + geom_point()

and the corresponding thing for the others.

Extra: my habit is to include all the explanatory variables in these plots, even the ones we removed from the regression. This is because there might have been something like a curved relationship with one of them, that didn’t get captured in the regression we fitted. Such a relationship would show up as a curved trend in the plot of residuals against that explanatory variable, and then we would know to go back and add a squared term in that explanatory variable. There are no problems of this kind here (all four of the plots look random), but it was good that we checked.

Footnotes

To make the whole geom_line red, rather than to colour it according to some other variable, what you do is to put the colour outside of the aes(), and have its value be some literal text in quotes, in this case the colour you’d like this geom_line to be.↩︎
If the code below confuses you, run it up to the pivot-longer first, see what it does, then think about how you have now made it easier to draw the plot.↩︎
This is because we removed the explanatory variable C that it was strongly correlated with. With that in mind, we know that B and D were strongly correlated as well, so removing D should make B a lot more significant.↩︎