---
title: "Regression revisited"
editor:
markdown:
wrap: 72
---
## Regression
- Use regression when one variable is an outcome (*response*, $y$).
- See if/how response depends on other variable(s), *explanatory*,
$x_1, x_2,\ldots$.
- Can have *one* or *more than one* explanatory variable, but always
one response.
- Assumes a *straight-line* relationship between response and
explanatory.
- Ask:
- *is there* a relationship between $y$ and $x$'s, and if so,
which ones?
- what does the relationship look like?
## Packages
```{r bRegression-1, results='hide'}
library(MASS) # for Box-Cox, later
library(tidyverse)
library(broom)
library(marginaleffects)
library(conflicted)
conflict_prefer("select", "dplyr")
```
## A regression with one $x$
13 children, measure average total sleep time (ATST, mins) and age
(years) for each. See if ATST depends on age. Data in `sleep.txt`, ATST
then age. Read in data:
```{r bRegression-2 }
my_url <- "http://ritsokiguess.site/datafiles/sleep.txt"
sleep <- read_delim(my_url, " ")
```
## Check data
```{r bRegression-3, size="footnotesize"}
summary(sleep)
```
```{r}
sleep
```
Make scatter plot of ATST (response) vs. age (explanatory) using code
overleaf:
## The scatterplot
```{r suggo, fig.height=6}
ggplot(sleep, aes(x = age, y = atst)) + geom_point()
```
## Correlation
- Measures how well a straight line fits the data:
```{r bRegression-4 }
with(sleep, cor(atst, age))
```
- $1$ is perfect upward trend, $-1$ is perfect downward trend, 0 is no
trend.
- This one close to perfect downward trend.
- Can do correlations of all pairs of variables:
```{r bRegression-5 }
cor(sleep)
```
## Lowess curve
- Sometimes nice to guide the eye: is the trend straight, or not?
- Idea: *lowess curve*. "Locally weighted least squares", not affected
by outliers, not constrained to be linear.
- Lowess is a *guide*: even if straight line appropriate, may
wiggle/bend a little. Looking for *serious* problems with linearity.
- Add lowess curve to plot using `geom_smooth`:
## Plot with lowess curve
```{r icko,fig.height=6}
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
geom_smooth()
```
## The regression
Scatterplot shows no obvious curve, and a pretty clear downward trend.
So we can run the regression:
```{r bRegression-6, echo=-1}
options(width = 60)
sleep.1 <- lm(atst ~ age, data = sleep)
```
## The output
```{r bRegression-7}
summary(sleep.1)
```
## Conclusions
- The relationship appears to be a straight line, with a downward
trend.
- $F$-tests for model as a whole and $t$-test for slope (same) both
confirm this (P-value $5.7\times 10^{-7}=0.00000057$).
- Slope is $-14$, so a 1-year increase in age goes with a 14-minute
decrease in ATST on average.
- R-squared is correlation squared (when one $x$ anyway), between 0
and 1 (1 good, 0 bad).
- Here R-squared is 0.9054, pleasantly high.
## Doing things with the regression output
- Output from regression (and eg. $t$-test) is all right to look at,
but hard to extract and re-use information from.
- Package `broom` extracts info from model output in way that can be
used in pipe (later):
```{r bRegression-8, size="footnotesize"}
tidy(sleep.1)
```
## also one-line summary of model:
```{r bRegression-9}
glance(sleep.1)
```
## Broom part 2
```{r bRegression-10,warning=F}
sleep.1 %>% augment(sleep)
```
Useful for plotting residuals against an $x$-variable.
## CI for mean response and prediction intervals
Once useful regression exists, use it for prediction:
- To get a single number for prediction at a given $x$, substitute
into regression equation, eg. age 10: predicted ATST is
$646.48-14.04(10)=506$ minutes.
- To express uncertainty of this prediction:
- *CI for mean response* expresses uncertainty about mean ATST for all
children aged 10, based on data.
- *Prediction interval* expresses uncertainty about predicted ATST for
a new child aged 10 whose ATST not known. More uncertain.
- Also do above for a child aged 5.
## The `marginaleffects` package 1/2
To get predictions for specific values, set up a dataframe with those
values first:
```{r}
new <- datagrid(model = sleep.1, age = c(10, 5))
new
```
Any variables in the dataframe that you don't specify are set to their
mean values (quantitative) or most common category (categorical).
## The `marginaleffects` package 2/2
Then feed into `newdata` in `predictions`. This contains a lot of
columns, so you probably want only to display the ones you care about:
```{r}
cbind(predictions(sleep.1, newdata = new)) %>%
select(estimate, conf.low, conf.high, age)
```
The confidence limits are a 95% confidence interval for the mean
response at that `age`.
## Prediction intervals
These are obtained (instead) with `predict` as below. Use the same
dataframe `new` as before:
```{r bRegression-12 }
pp <- predict(sleep.1, new, interval = "p")
pp
cbind(new, pp) %>% select(-atst)
```
## Plotting the confidence intervals for mean response again:
```{r}
plot_predictions(sleep.1, condition = "age")
```
## Comments
- Age 10 closer to centre of data, so intervals are both narrower than
those for age 5.
- Prediction intervals bigger than CI for mean (additional
uncertainty).
- Technical note: output from `predict` is R `matrix`, not data frame,
so Tidyverse `bind_cols` does not work. Use base R `cbind`.
## That grey envelope
Marks confidence interval for mean for all $x$:
```{r bRegression-15, fig.height=6.0}
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
geom_smooth(method = "lm") +
scale_y_continuous(breaks = seq(420, 600, 20))
```
## Diagnostics
How to tell whether a straight-line regression is appropriate?
- Before: check scatterplot for straight trend.
- After: plot *residuals* (observed minus predicted response) against
predicted values. Aim: a plot with no pattern.
## Residual plot
Not much pattern here --- regression appropriate.
```{r akjhkadjfhjahnkkk,fig.height=3.4}
ggplot(sleep.1, aes(x = .fitted, y = .resid)) + geom_point()
```
## An inappropriate regression
Different data:
```{r curvy}
my_url <- "http://ritsokiguess.site/datafiles/curvy.txt"
curvy <- read_delim(my_url, " ")
```
## Scatterplot
```{r bRegression-16, fig.height=3.8}
ggplot(curvy, aes(x = xx, y = yy)) + geom_point()
```
## Regression line, anyway
\scriptsize
```{r bRegression-17}
curvy.1 <- lm(yy ~ xx, data = curvy)
summary(curvy.1)
```
## Residual plot
```{r altoadige,fig.height=3.8}
ggplot(curvy.1, aes(x = .fitted, y = .resid)) + geom_point()
```
## No good: fixing it up
- Residual plot has *curve*: middle residuals positive, high and low
ones negative. Bad.
- Fitting a curve would be better. Try this:
```{r bRegression-18 }
curvy.2 <- lm(yy ~ xx + I(xx^2), data = curvy)
```
- Adding `xx`-squared term, to allow for curve.
- Another way to do same thing: specify how model *changes*:
```{r bRegression-19 }
curvy.2a <- update(curvy.1, . ~ . + I(xx^2))
```
## Regression 2
```{r bRegression-20 }
tidy(curvy.2)
glance(curvy.2) #
```
## Comments
- `xx`-squared term definitely significant (P-value 0.000182), so need
this curve to describe relationship.
- Adding squared term has made R-squared go up from 0.5848 to 0.9502:
great improvement.
- This is a definite curve!
## The residual plot now
No problems any more:
```{r bRegression-21, size="small", fig.height=3.4}
ggplot(curvy.2, aes(x = .fitted, y = .resid)) + geom_point()
```
## Another way to handle curves
- Above, saw that changing $x$ (adding $x^2$) was a way of handling
curved relationships.
- Another way: change $y$ (transformation).
- Can guess how to change $y$, or might be theory:
- example: relationship $y=ae^{bx}$ (exponential growth):
- take logs to get $\ln y=\ln a + bx$.
- Taking logs has made relationship linear ($\ln y$ as response).
- Or, *estimate* transformation, using Box-Cox method.
## Box-Cox
- Install package `MASS` via `install.packages("MASS")` (only need to
do *once*)
- Every R session you want to use something in `MASS`, type
`library(MASS)`
## Some made-up data
```{r bRegression-22, message=F}
my_url <- "http://ritsokiguess.site/datafiles/madeup2.csv"
madeup <- read_csv(my_url)
madeup
```
Seems to be faster-than-linear growth, maybe exponential growth.
## Scatterplot: faster than linear growth
```{r dsljhsdjlhf,fig.height=3.2}
ggplot(madeup, aes(x = x, y = y)) + geom_point() +
geom_smooth()
```
## Running Box-Cox
- `library(MASS)` first.
- Feed `boxcox` a model formula with a squiggle in it, such as you
would use for `lm`.
- Output: a graph (next page):
```{r bRegression-23, eval=F}
boxcox(y ~ x, data = madeup)
```
## The Box-Cox output
```{r trento,echo=F, fig.height=4}
boxcox(y ~ x, data = madeup)
```
## Comments
- $\lambda$ (lambda) is the power by which you should transform $y$ to
get the relationship straight (straighter). Power 0 is "take logs"
- Middle dotted line marks best single value of $\lambda$ (here about
0.1).
- Outer dotted lines mark 95% CI for $\lambda$, here $-0.3$ to 0.7,
approx. (Rather uncertain about best transformation.)
- Any power transformation within the CI supported by data. In this
case, log ($\lambda=0$) and square root ($\lambda=0.5$) good, but no
transformation ($\lambda=1$) not.
- Pick a "round-number" value of $\lambda$ like $2,1,0.5,0,-0.5,-1$.
Here 0 and 0.5 good values to pick.
## Did transformation straighten things?
- Plot transformed $y$ against $x$. Here, log:
```{r bRegression-24, message=FALSE, fig.height=3.2 }
ggplot(madeup, aes(x = x, y = log(y))) + geom_point() +
geom_smooth()
```
Looks much straighter.
## Regression with transformed $y$
```{r bRegression-25}
madeup.1 <- lm(log(y) ~ x, data = madeup)
glance(madeup.1)
tidy(madeup.1)
```
R-squared now decently high.
## Multiple regression
- What if more than one $x$? Extra issues:
- Now one intercept and a slope for each $x$: how to interpret?
- Which $x$-variables actually help to predict $y$?
- Different interpretations of "global" $F$-test and individual
$t$-tests.
- R-squared no longer correlation squared, but still interpreted
as "higher better".
- In `lm` line, add extra $x$s after `~`.
- Interpretation not so easy (and other problems that can occur).
## Multiple regression example
Study of women and visits to health professionals, and how the number of
visits might be related to other variables:
```{=tex}
\begin{description}
\item[timedrs:] number of visits to health professionals (over course of study)
\item[phyheal:] number of physical health problems
\item[menheal:] number of mental health problems
\item[stress:] result of questionnaire about number and type of life changes
\end{description}
```
`timedrs` response, others explanatory.
## The data
```{r bRegression-26 }
my_url <-
"http://ritsokiguess.site/datafiles/regressx.txt"
visits <- read_delim(my_url, " ")
```
## Check data
```{r bRegression-27}
visits
```
## Fit multiple regression
```{r bRegression-28}
visits.1 <- lm(timedrs ~ phyheal + menheal + stress,
data = visits)
summary(visits.1)
```
## The slopes
- Model as a whole strongly significant even though R-sq not very big
(lots of data). At least one of the $x$'s predicts `timedrs`.
- The physical health and stress variables definitely help to predict
the number of visits, but *with those in the model* we don't need
`menheal`. However, look at prediction of `timedrs` from `menheal`
by itself:
## Just `menheal`
```{r bRegression-30 }
visits.2 <- lm(timedrs ~ menheal, data = visits)
summary(visits.2)
```
## `menheal` by itself
- `menheal` by itself *does* significantly help to predict `timedrs`.
- But the R-sq is much less (6.5% vs. 22%).
- So other two variables do a better job of prediction.
- With those variables in the regression (`phyheal` and `stress`),
don't need `menheal` *as well*.
## Investigating via correlation
Leave out first column (`subjno`):
```{r bRegression-31 }
visits %>% select(-subjno) %>% cor()
```
- `phyheal` most strongly correlated with `timedrs`.
- Not much to choose between other two.
- But `menheal` has higher correlation with `phyheal`, so not as much
to *add* to prediction as `stress`.
- Goes to show things more complicated in multiple regression.
## Residual plot (from `timedrs` on all)
```{r iffy8,fig.height=3.5}
ggplot(visits.1, aes(x = .fitted, y = .resid)) + geom_point()
```
Apparently random. But...
## Normal quantile plot of residuals
```{r bRegression-32, fig.height=3.5}
ggplot(visits.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
Not normal at all; upper tail is way too long.
## Absolute residuals
Is there trend in *size* of residuals (fan-out)? Plot *absolute value*
of residual against fitted value:
```{r bRegression-33, fig.height=2.8, message=FALSE}
ggplot(visits.1, aes(x = .fitted, y = abs(.resid))) +
geom_point() + geom_smooth()
```
## Comments
- On the normal quantile plot:
- highest (most positive) residuals are *way* too high
- distribution of residuals skewed to right (not normal at all)
- On plot of absolute residuals:
- size of residuals getting bigger as fitted values increase
- predictions getting more variable as fitted values increase
- that is, predictions getting *less accurate* as fitted values
increase, but predictions should be equally accurate all way
along.
- Both indicate problems with regression, of kind that transformation
of response often fixes: that is, predict *function* of response
`timedrs` instead of `timedrs` itself.
## Box-Cox transformations
- Taking log of `timedrs` and having it work: lucky guess. How to find
good transformation?
- Box-Cox again.
- Extra problem: some of `timedrs` values are 0, but Box-Cox expects
all +. Note response for `boxcox`:
```{r bRegression-35, eval=F}
boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits)
```
## Try 1
```{r bRegression-36, echo=F,fig.height=4.5}
boxcox(timedrs + 1 ~ phyheal + menheal + stress,
data = visits)
```
## Comments on try 1
- Best: $\lambda$ just less than zero.
- Hard to see scale.
- Focus on $\lambda$ in $(-0.3,0.1)$:
```{r bRegression-37, size="footnotesize"}
my.lambda <- seq(-0.3, 0.1, 0.01)
my.lambda
```
## Try 2
```{r bRegression-38, fig.height=3.5}
boxcox(timedrs + 1 ~ phyheal + menheal + stress,
lambda = my.lambda,
data = visits
)
```
## Comments
- Best: $\lambda$ just about $-0.07$.
- CI for $\lambda$ about $(-0.14,0.01)$.
- Only nearby round number: $\lambda=0$, log transformation.
## Fixing the problems
- Try regression again, with transformed response instead of original
one.
- Then check residual plot to see that it is OK now.
```{r bRegression-39 }
visits.3 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress,
data = visits
)
```
- `timedrs+1` because some `timedrs` values 0, can't take log of 0.
- Won't usually need to worry about this, but when response could be
zero/negative, fix that before transformation.
## Output
\scriptsize
```{r bRegression-40 }
summary(visits.3)
```
## Comments
- Model as a whole strongly significant again
- R-sq higher than before (37% vs. 22%) suggesting things more linear
now
- Same conclusion re `menheal`: can take out of regression.
- Should look at residual plots (next pages). Have we fixed problems?
## Residuals against fitted values
```{r bRegression-41, fig.height=3.5}
ggplot(visits.3, aes(x = .fitted, y = .resid)) +
geom_point()
```
## Normal quantile plot of residuals
```{r bRegression-42, fig.height=3.8}
ggplot(visits.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
## Absolute residuals against fitted
```{r bRegression-43, fig.height=3.5, message=F}
ggplot(visits.3, aes(x = .fitted, y = abs(.resid))) +
geom_point() + geom_smooth()
```
## Comments
- Residuals vs. fitted looks a lot more random.
- Normal quantile plot looks a lot more normal (though still a little
right-skewness)
- Absolute residuals: not so much trend (though still some).
- Not perfect, but much improved.
## Testing more than one $x$ at once
- The $t$-tests test only whether one variable could be taken out of
the regression you're looking at.
- To test significance of more than one variable at once, fit model
with and without variables
- then use `anova` to compare fit of models:
```{r bRegression-44 }
visits.5 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress,
data = visits)
visits.6 <- lm(log(timedrs + 1) ~ stress, data = visits)
```
## Results of tests
```{r bRegression-45}
anova(visits.6, visits.5)
```
- Models don't fit equally well, so bigger one fits better.
- Or "taking both variables out makes the fit worse, so don't do it".
- Taking out those $x$'s is a mistake. Or putting them in is a good
idea.
## The punting data
Data set `punting.txt` contains 4 variables for 13 right-footed football
kickers (punters): left leg and right leg strength (lbs), distance
punted (ft), another variable called "fred". Predict punting distance
from other variables:
\scriptsize
```
left right punt fred
170 170 162.50 171
130 140 144.0 136
170 180 174.50 174
160 160 163.50 161
150 170 192.0 159
150 150 171.75 151
180 170 162.0 174
110 110 104.83 111
110 120 105.67 114
120 130 117.58 126
140 120 140.25 129
130 140 150.17 136
150 160 165.17 154
```
## Reading in
- Separated by *multiple spaces* with *columns lined up*:
```{r bRegression-46 }
my_url <- "http://ritsokiguess.site/datafiles/punting.txt"
punting <- read_table(my_url)
```
## The data
```{r bRegression-47}
punting
```
## Regression and output
```{r bRegression-48}
punting.1 <- lm(punt ~ left + right + fred, data = punting)
glance(punting.1)
tidy(punting.1)
summary(punting.1)
```
## Comments
- Overall regression strongly significant, R-sq high.
- None of the $x$'s significant! Why?
- $t$-tests only say that you could take any one of the $x$'s out
without damaging the fit; doesn't matter which one.
- Explanation: look at *correlations*.
## The correlations
```{r bRegression-49 }
cor(punting)
```
- *All* correlations are high: $x$'s with `punt` (good) and with each
other (bad, at least confusing).
- What to do? Probably do just as well to pick one variable, say
`right` since kickers are right-footed.
## Just `right`
```{r bRegression-50 }
punting.2 <- lm(punt ~ right, data = punting)
summary(punting.2)
anova(punting.2, punting.1)
punting.3 <- lm(punt ~ left, data = punting)
summary(punting.3)
```
No significant loss by dropping other two variables.
## Comparing R-squareds
```{r bRegression-51 }
summary(punting.1)$r.squared
summary(punting.2)$r.squared
```
Basically no difference. In regression (over), `right` significant:
## Regression results
```{r bRegression-52 }
tidy(punting.2)
```
## But\ldots
- Maybe we got the *form* of the relationship with `left` wrong.
- Check: plot *residuals* from previous regression (without `left`)
against `left`.
- Residuals here are "punting distance adjusted for right leg
strength".
- If there is some kind of relationship with `left`, we should include
in model.
- Plot of residuals against original variable: `augment` from `broom`.
## Augmenting `punting.2`
```{r bRegression-53 }
punting.2 %>% augment(punting) -> punting.2.aug
punting.2.aug
```
## Residuals against `left`
```{r basingstoke,fig.height=3.5}
ggplot(punting.2.aug, aes(x = left, y = .resid)) +
geom_point()
```
## Comments
- There is a *curved* relationship with `left`.
- We should add `left`-squared to the regression (and therefore put
`left` back in when we do that):
```{r bRegression-54 }
punting.3 <- lm(punt ~ left + I(left^2) + right,
data = punting
)
```
## Regression with `left-squared`
\scriptsize
```{r bRegression-55}
summary(punting.3)
```
## Comments
- This was definitely a good idea (R-squared has clearly increased).
- We would never have seen it without plotting residuals from
`punting.2` (without `left`) against `left`.
- Negative slope for `leftsq` means that increased left-leg strength
only increases punting distance up to a point: beyond that, it
decreases again.