---
title: "Bootstrap for sampling distribution of sample mean"
---
## Assessing assumptions
- Our $t$-tests assume normality of variable being tested
- but, Central Limit Theorem says that normality matters less if
sample is "large"
- in practice "approximate normality" is enough, but how do we assess
whether what we have is normal enough?
- so far, use histogram/boxplot and make a call, allowing for sample
size.
## What actually has to be normal
- is: **sampling distribution of sample mean**
- the distribution of sample mean over *all possible samples*
- but we only have *one* sample!
- Idea: assume our sample is representative of the population, and
draw samples from our sample (!), with replacement.
- This gives an idea of what different samples from the population
might look like.
- Called *bootstrap*, after expression "to pull yourself up by your
own bootstraps".
## Packages
```{r}
library(tidyverse)
```
## Blue Jays attendances
```{r bootstrap-R-1, echo=FALSE, message=FALSE}
jays <- read_csv("http://ritsokiguess.site/datafiles/jays15-home.csv")
set.seed(457299)
```
```{r bootstrap-R-2}
jays$attendance
```
- A bootstrap sample:
```{r bootstrap-R-3}
s <- sample(jays$attendance, replace = TRUE)
s
```
- It is easier to see what is happening if we sort both the actual
attendances and the bootstrap sample:
```{r}
sort(jays$attendance)
sort(s)
```
## Getting mean of bootstrap sample
- A bootstrap sample is same size as original, but contains repeated
values (eg. 15062) and missing ones (42917).
- We need the mean of our bootstrap sample:
```{r bootstrap-R-4}
mean(s)
```
- This is a little different from the mean of our actual sample:
```{r bootstrap-R-5}
mean(jays$attendance)
```
- Want a sense of how the sample mean might vary, if we were able to
take repeated samples from our population.
- Idea: take lots of *bootstrap* samples, and see how *their* sample
means vary.
## Setting up bootstrap sampling
- Begin by setting up a dataframe that contains a row for each
bootstrap sample. I usually call this column `sim`. Do just 4 to get
the idea:
```{r bootstrap-R-6}
tibble(sim = 1:4)
```
## Drawing the bootstrap samples
- Then set up to work one row at a time, and draw a bootstrap sample
of the attendances in each row:
```{r bootstrap-R-7}
tibble(sim = 1:4) %>%
rowwise() %>%
mutate(sample = list(sample(jays$attendance, replace = TRUE)))
```
- Each row of our dataframe contains *all* of a bootstrap sample of 25
observations drawn with replacement from the attendances.
## Sample means
- Find the mean of each sample:
```{r bootstrap-R-8}
tibble(sim = 1:4) %>%
rowwise() %>%
mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>%
mutate(my_mean = mean(sample))
```
- These are (four simulated values of) the bootstrapped sampling
distribution of the sample mean.
## Make a histogram of them
- rather pointless here, but to get the idea:
```{r bootstrap-R-9}
tibble(sim = 1:4) %>%
rowwise() %>%
mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>%
mutate(my_mean = mean(sample)) %>%
ggplot(aes(x = my_mean)) + geom_histogram(bins = 3) -> g
```
## The (pointless) histogram
```{r bootstrap-R-10}
g
```
## Now do again with a decent number of bootstrap samples
- say 1000, and put a decent number of bins on the histogram also:
```{r bootstrap-R-11}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>%
mutate(my_mean = mean(sample)) %>%
ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g
```
## The (better) histogram
```{r bootstrap-R-12}
g
```
## Comments
- This is very close to normal
- The bootstrap says that the sampling distribution of the sample mean
is close to normal, even though the distribution of the data is not
- A sample size of 25 is big enough to overcome the skewness that we
saw
- This is the Central Limit Theorem in practice
- It is surprisingly powerful.
- Thus, the $t$-test is actually perfectly good here.
## Comments on the code 1/2
- You might have been wondering about this:
```{r bootstrap-R-13}
tibble(sim = 1:4) %>%
rowwise() %>%
mutate(sample = list(sample(jays$attendance, replace = TRUE)))
```
## Comments on the code 2/2
- how did we squeeze all 25 sample values into one cell?
- sample is a so-called "list-column" that can contain anything.
- why did we have to put `list()` around the `sample()`?
- because `sample` produces a collection of numbers, not just a
single one
- the `list()` signals this: "make a list-column of samples".
## Two samples
- Assumption: *both* samples are from a normal distribution.
- In this case, each sample should be "normal enough" given its sample
size, since Central Limit Theorem will help.
- Use bootstrap on each group independently, as above.
## Kids learning to read
```{r bootstrap-R-14, echo=FALSE, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
kids
```
```{r bootstrap-R-15}
ggplot(kids, aes(x=group, y=score)) + geom_boxplot()
```
## Getting just the control group
- Use `filter` to select rows where something is true:
```{r bootstrap-R-16}
kids %>% filter(group=="c") -> controls
controls
```
## Bootstrap these
```{r bootstrap-R-17}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(sample = list(sample(controls$score, replace = TRUE))) %>%
mutate(my_mean = mean(sample)) %>%
ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)
```
## ... and the treatment group:
```{r bootstrap-R-19}
kids %>% filter(group=="t") -> treats
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(sample = list(sample(treats$score, replace = TRUE))) %>%
mutate(my_mean = mean(sample)) %>%
ggplot(aes(x = my_mean)) + geom_histogram(bins = 15)
```
## Comments
- sampling distributions of sample means both look pretty normal,
though treatment group is a tiny bit left-skewed
- as we thought, no problems with our two-sample $t$ at all.