
title: "Case study: asphalt"
editor:
markdown:
wrap: 72

## The asphalt data
 31 asphalt pavements prepared under different conditions. How does
quality of pavement depend on these?
 Variables:
 `pct.a.surf` Percentage of asphalt in surface layer
 `pct.a.base` Percentage of asphalt in base layer
 `fines` Percentage of fines in surface layer
 `voids` Percentage of voids in surface layer
 `rut.depth` Change in rut depth per million vehicle passes
 `viscosity` Viscosity of asphalt
 `run` 2 data collection periods: 1 for run 1, 0 for run 2.
 `rut.depth` response. Depends on other variables, how?
## Packages for this section
```{r asphalt1}
library(MASS)
library(tidyverse)
library(broom)
library(leaps)
```
Make sure to load `MASS` before `tidyverse` (for annoying technical
reasons).
## Getting set up
```{r asphalt2}
my_url < "http://ritsokiguess.site/datafiles/asphalt.txt"
asphalt < read_delim(my_url, " ")
```
 Quantitative variables with one response: multiple regression.
 Some issues here that don't come up in "simple" regression; handle
as we go. (STAB27/STAC67 ideas.)
## The data (some)
```{r asphalt3}
asphalt
```
## Plotting response "rut depth" against everything else
Same idea as for plotting separate predictions on one plot:
```{r asphalt4}
asphalt %>%
pivot_longer(
rut.depth,
names_to="xname", values_to="x"
) %>%
ggplot(aes(x = x, y = rut.depth)) + geom_point() +
facet_wrap(~xname, scales = "free") > g
```
"collect all the xvariables together into one column called x, with
another column xname saying which x they were, then plot these x's
against rut.depth, a separate facet for each xvariable."
I saved this graph to plot later (on the next page).
## The plot
```{r asphalt5}
g
```
## Interpreting the plots
 One plot of rut depth against each of the six other variables.
 Get rough idea of what's going on.
 Trends mostly weak.
 `viscosity` has strong but nonlinear trend.
 `run` has effect but variability bigger when run is 1.
 Weak but downward trend for `voids`.
 Nonlinearity of `rut.depth``viscosity` relationship should concern
us.
## Log of `viscosity`: more nearly linear?
 Take this back to asphalt engineer: suggests log of `viscosity`:
```{r logvisplot, fig.keep="none", warning=F, message=F}
ggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) +
geom_point() + geom_smooth(se = F) > g
```
(plot overleaf)
## Rut depth against logviscosity
```{r asphalt6, ref.label="logvisplot", echo=FALSE, message=FALSE}
g
```
## Comments and next steps
 Not very linear, but better than before.
 In multiple regression, hard to guess which x's affect response. So
typically start by predicting from everything else.
 Model formula has response on left, squiggle, explanatories on right
joined by plusses:
```{r asphalt7}
rut.1 < lm(rut.depth ~ pct.a.surf + pct.a.base + fines +
voids + log(viscosity) + run, data = asphalt)
summary(rut.1)
```
## Regression output: `summary(rut.1)` or:
\footnotesize
```{r asphalt8}
glance(rut.1)
tidy(rut.1)
```
\normalsize
## Comments
 Rsquared 81%, not so bad.
 Pvalue in `glance` asserts that something helping to predict
rut.depth.
 Table of coefficients says `log(viscosity)`.
 But confused by clearly nonsignificant variables: remove those to
get clearer picture of what is helpful.

## Before we do anything, look at residual plots:
```
(a) of residuals against fitted values (as usual)
```

(b) of residuals against each explanatory.
 Problem fixes:
 with (a): fix response variable;
 with some plots in (b): fix those explanatory variables.
## Plot fitted values against residuals
```{r asphalt9}
ggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()
```
## Normal quantile plot of residuals
```{r}
ggplot(rut.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
## Plotting residuals against $x$ variables
 Problem here is that residuals are in the fitted model, and the
observed $x$values are in the original data frame `asphalt`.
 Package broom contains a function `augment` that combines these two
together so that they can later be plotted: start with a model
first, and then augment with a data frame:
```{r asphalt10}
rut.1 %>% augment(asphalt) > rut.1a
```
## What does rut.1a contain?
```{r asphalt11, echo=FALSE}
#options(width = 70)
```
```{r asphalt12}
names(rut.1a)
```
 all the stuff in original data frame, plus:
 quantities from regression (starting with a dot)
## Plotting residuals against $x$variables
```{r asphalt13}
rut.1a %>%
mutate(log_vis=log(viscosity)) %>%
pivot_longer(
c(pct.a.surf:voids, run, log_vis),
names_to="xname", values_to="x"
) %>%
ggplot(aes(x = x, y = .resid)) +
geom_point() + facet_wrap(~xname, scales = "free") > g
```
## The plot
```{r asphalt14}
g
```
## Comments
 There is serious curve in plot of residuals vs. fitted values.
Suggests a transformation of $y$.
 The residualsvs$x$'s plots don't show any serious trends. Worst
probably that potential curve against logviscosity.
 Also, large positive residual, 10, that shows up on all plots.
Perhaps transformation of $y$ will help with this too.
 If residualfitted plot OK, but some residual$x$ plots not, try
transforming those $x$'s, eg. by adding $x^2$ to help with curve.
## Which transformation?
 Best way: consult with person who brought you the data.
 Can't do that here!
 No idea what transformation would be good.
 Let data choose: "BoxCox transformation".
 Scale is that of "ladder of powers": power transformation, but 0 is
log.
## Running BoxCox
From package `MASS`:
```{r asphalt15}
boxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids +
log(viscosity) + run, data = asphalt)
```
## Comments on BoxCox plot
 $\lambda$ represents power to transform $y$ with.
 Best single choice of transformation parameter $\lambda$ is peak of
curve, close to 0.
 Vertical dotted lines give CI for $\lambda$, about (−0.05, 0.2).
 $\lambda = 0$ means "log".
 Narrowness of confidence interval mean that these not supported by
data:
 No transformation ($\lambda = 1$)
 Square root ($\lambda = 0.5$)
 Reciprocal ($\lambda = −1$).
## Relationships with explanatories
 As before: plot response (now `log(rut.depth)`) against other
explanatory variables, all in one shot:
```{r asphalt16}
asphalt %>%
mutate(log_vis=log(viscosity)) %>%
pivot_longer(
c(pct.a.surf:voids, run, log_vis),
names_to="xname", values_to="x"
) %>%
ggplot(aes(y = log(rut.depth), x = x)) + geom_point() +
facet_wrap(~xname, scales = "free") > g3
```
## The new plots
```{r asphalt17}
g3
```
## Modelling with transformed response
 These trends look pretty straight, especially with `log.viscosity`.
 Values of `log.rut.depth` for each `run` have same spread.
 Other trends weak, but are straight if they exist.
 Start modelling from the beginning again.
 Model `log.rut.depth` in terms of everything else, see what can be
removed:
```{r asphalt18}
rut.2 < lm(log(rut.depth) ~ pct.a.surf + pct.a.base +
fines + voids + log(viscosity) + run, data = asphalt)
```
 use `tidy` from `broom` to display just the coefficients.
## Output
```{r asphalt19}
tidy(rut.2)
summary(rut.2)
```
## Taking out everything nonsignificant
 Try: remove everything but pct.a.surf and log.viscosity:
\footnotesize
```{r asphalt20}
rut.3 < lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)
summary(rut.3)
```
\normalsize
\footnotesize
 Check that removing all those variables wasn't too much:
```{r asphalt21}
anova(rut.3, rut.2)
```
\normalsize
 $H_0$ : two models equally good; $H_a$ : bigger model better.
 Null not rejected here; small model as good as the big one, so
prefer simpler smaller model `rut.3`.
## Find the largest Pvalue by eye:
```{r asphalt22}
tidy(rut.2)
```
 Largest Pvalue is 0.78 for `pct.a.base`, not significant.
 So remove this first, refit and reassess.
 Or, as over.
## Get the computer to find the largest Pvalue for you
 Output from `tidy` is itself a data frame, thus:
```{r asphalt23}
tidy(rut.2) %>% arrange(p.value)
```
 Largest Pvalue at the bottom.
## Take out `pct.a.base`
 Copy and paste the `lm` code and remove what you're removing:
\small
```{r asphalt24}
rut.4 < lm(log(rut.depth) ~ pct.a.surf + fines + voids +
log(viscosity) + run, data = asphalt)
tidy(rut.4) %>% arrange(p.value) %>% dplyr::select(term, p.value)
```
\normalsize
 `fines` is next to go, Pvalue 0.32.
## "Update"
Another way to do the same thing:
```{r asphalt25}
rut.4 < update(rut.2, . ~ .  pct.a.base)
tidy(rut.4) %>% arrange(p.value)
```
 Again, `fines` is the one to go. (Output identical as it should be.)
## Take out fines:
```{r asphalt26}
rut.5 < update(rut.4, . ~ .  fines)
tidy(rut.5) %>% arrange(p.value) %>% dplyr::select(term, p.value)
```
Can't take out intercept, so `run`, with Pvalue 0.36, goes next.
## Take out run:
```{r asphalt27}
rut.6 < update(rut.5, . ~ .  run)
tidy(rut.6) %>% arrange(p.value) %>% dplyr::select(term, p.value)
```
Again, can't take out intercept, so largest Pvalue is for `voids`,
0.044. But this is significant, so we shouldn't remove `voids`.
## Comments
 Here we stop: `pct.a.surf`, `voids` and `log.viscosity` would all
make fit significantly worse if removed. So they stay.
 Different final result from taking things out one at a time (top),
than by taking out 4 at once (bottom):
```{r asphalt28}
summary(rut.6)
coef(rut.6)
coef(rut.3)
```
 Point: Can make difference which way we go.
## Comments on variable selection
 Best way to decide which $x$'s belong: expert knowledge: which of
them should be important.
 Best automatic method: what we did, "backward selection".
 Do not learn about "stepwise regression"! [**eg.
here**](https://towardsdatascience.com/stoppingstepwisewhystepwiseselectionisbadandwhatyoushoulduseinstead90818b3f52df)
 R has function `step` that does backward selection, like this:
```{r asphalt29, eval=F}
step(rut.2, direction = "backward", test = "F")
```
Gets same answer as we did (by removing least significant x).
 Removing nonsignificant $x$'s may remove interesting ones whose
Pvalues happened not to reach 0.05. Consider using less stringent
cutoff like 0.20 or even bigger.
 Can also fit all possible regressions, as over (may need to do
`install.packages("leaps")` first).
## All possible regressions (output over)
Uses package `leaps`:
```{r asphalt30}
leaps < regsubsets(log(rut.depth) ~ pct.a.surf +
pct.a.base + fines + voids +
log(viscosity) + run,
data = asphalt, nbest = 2)
s < summary(leaps)
with(s, data.frame(rsq, outmat)) > d
```
## The output
```{r asphalt31, echo=F}
wid=getOption("width")
options(width=80)
```
\scriptsize
```{r asphalt32}
d %>% rownames_to_column("model") %>% arrange(desc(rsq))
```
\normalsize
```{r asphalt33, echo=F}
options(width=wid)
```
## Comments
 Problem: even adding a worthless x increases Rsquared. So try for
line where Rsquared stops increasing "too much", eg. top line (just
log.viscosity), first 3variable line (backwardselimination model).
Hard to judge.
 One solution (STAC67): adjusted Rsquared, where adding worthless
variable makes it go down.
 `data.frame` rather than `tibble` because there are several columns
in `outmat`.
## All possible regressions, adjusted Rsquared
```{r asphalt34, echo=F}
wid=getOption("width")
options(width=80)
```
\scriptsize
```{r asphalt35}
with(s, data.frame(adjr2, outmat)) %>%
rownames_to_column("model") %>%
arrange(desc(adjr2))
```
\normalsize
```{r asphalt36, echo=F}
options(width=wid)
```
## Revisiting the best model
 Best model was our rut.6:
```{r asphalt37}
tidy(rut.6)
```
## Revisiting (2)
 Regression slopes say that rut depth increases as logviscosity
decreases, `pct.a.surf` increases and `voids` increases. This more
or less checks out with out scatterplots against `log.viscosity`.
 We should check residual plots again, though previous scatterplots
say it's unlikely that there will be a problem:
```{r asphalt38}
g < ggplot(rut.6, aes(y = .resid, x = .fitted)) +
geom_point()
```
## Residuals against fitted values
```{r asphalt39}
g
```
```{r}
ggplot(rut.6, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
## Plotting residuals against x's
 Do our trick again to put them all on one plot:
```{r asphalt40}
augment(rut.6, asphalt) %>%
mutate(log_vis=log(viscosity)) %>%
pivot_longer(
c(pct.a.surf:voids, run, log_vis),
names_to="xname", values_to="x",
) %>%
ggplot(aes(y = .resid, x = x)) + geom_point() +
facet_wrap(~xname, scales = "free") > g2
```
## Residuals against the x's
```{r asphalt41}
g2
```
## Comments
 None of the plots show any sort of pattern. The points all look
random on each plot.
 On the plot of fitted values (and on the one of log.viscosity), the
points seem to form a "left half" and a "right half" with a gap in
the middle. This is not a concern.
 One of the pct.a.surf values is low outlier (4), shows up top left
of that plot.
 Only two possible values of run; the points in each group look
randomly scattered around 0, with equal spreads.
 Residuals seem to go above zero further than below, suggesting a
mild nonnormality, but not enough to be a problem.
## Variableselection strategies
 Expert knowledge.
 Backward elimination.
 All possible regressions.
 Taking a variety of models to experts and asking their opinion.
 Use a looser cutoff to eliminate variables in backward elimination
(eg. only if Pvalue greater than 0.20).
 If goal is prediction, eliminating worthless variables less
important.
 If goal is understanding, want to eliminate worthless variables
where possible.
 Results of variable selection not always reproducible, so caution
advised.