---
title: "Regression with categorical variables"
editor:
markdown:
wrap: 72
---
## Packages for this section
```{r with-categ-R-1}
library(tidyverse)
library(broom)
```
## The pigs revisited
```{r with-categ-R-2, echo=FALSE}
options(dplyr.summarise.inform = FALSE)
```
- Recall pig feed data, after we tidied it:
```{r with-categ-R-3, message=F}
my_url <- "http://ritsokiguess.site/datafiles/pigs2.txt"
pigs <- read_delim(my_url, " ")
pigs
```
## Summaries
```{r with-categ-R-4}
pigs %>%
group_by(feed) %>%
summarize(n = n(), mean_wt = mean(weight),
sd_wt = sd(weight))
```
## Running through `aov` and `lm`
- What happens if we run this through `lm` rather than `aov`?
- Recall `aov` first:
```{r with-categ-R-5}
pigs.1 <- aov(weight ~ feed, data = pigs)
summary(pigs.1)
```
## and now `lm`
\footnotesize
```{r with-categ-R-6}
pigs.2 <- lm(weight ~ feed, data = pigs)
summary(pigs.2)
tidy(pigs.2)
glance(pigs.2)
```
\normalsize
## Understanding those slopes {.scrollable}
- Get one slope for each category of categorical variable feed, except
for first.
- feed1 treated as "baseline", others measured relative to that.
- Thus prediction for feed 1 is intercept, 60.62 (mean weight for feed
1).
- Prediction for feed 2 is 60.62 + 8.68 = 69.30 (mean weight for feed
2).
- Or, mean weight for feed 2 is 8.68 bigger than for feed 1.
- Mean weight for feed 3 is 33.48 bigger than for feed 1.
- Slopes can be negative, if mean for a feed had been smaller than for
feed 1.
## Reproducing the ANOVA
- Pass the fitted model object into `anova`:
\footnotesize
```{r with-categ-R-7}
anova(pigs.2)
```
\normalsize
- Same as before.
- But no Tukey this way:
\footnotesize
```{r with-categ-R-8, error=TRUE}
TukeyHSD(pigs.2)
```
\normalsize
## The crickets
- Male crickets rub their wings together to produce a chirping sound.
- Rate of chirping, called "pulse rate", depends on species and
possibly on temperature.
- Sample of crickets of two species' pulse rates measured; temperature
also recorded.
- Does pulse rate differ for species, especially when temperature
accounted for?
## The crickets data
Read the data:
```{r with-categ-R-9, message=F}
my_url <- "http://ritsokiguess.site/datafiles/crickets2.csv"
crickets <- read_csv(my_url)
crickets %>% slice_sample(n = 10)
```
## Fit model with `lm`
```{r with-categ-R-10}
crickets.1 <- lm(pulse_rate ~ temperature + species,
data = crickets)
```
Can I remove anything? No:
```{r with-categ-R-11}
drop1(crickets.1, test = "F")
```
`drop1` is right thing to use in a regression with categorical
(explanatory) variables in it: "can I remove this categorical variable
*as a whole*?"
## The summary
```{r with-categ-R-12}
summary(crickets.1)
```
## Conclusions
- Slope for temperature says that increasing temperature by 1 degree
increases pulse rate by 3.6 (same for both species)
- Slope for `speciesniveus` says that pulse rate for `niveus` about 10
lower than that for `exclamationis` at same temperature (latter
species is baseline).
- R-squared of almost 0.99 is very high, so that the prediction of
pulse rate from species and temperature is very good.
## To end with a graph
- Two quantitative variables and one categorical: scatterplot with
categories distinguished by colour.
- This graph seems to need a title, which I define first.
```{r with-categ-R-13}
t1 <- "Pulse rate against temperature for two species of crickets"
t2 <- "Temperature in degrees Celsius"
ggplot(crickets, aes(x = temperature, y = pulse_rate,
colour = species)) +
geom_point() + geom_smooth(method = "lm", se = FALSE) +
ggtitle(t1, t2) -> g
```
## The graph
```{r}
g
```