---
title: "The bootstrap"
---
## Packages for this section
```{r bootstrap-1}
library(tidyverse)
library(bootstrap)
```
Source: [Hesterberg et al](https://www.researchgate.net/publication/265399426_Bootstrap_Methods_and_Permutation_Tests)
## Is my sampling distribution normal enough?
- Recall IRS data (used as a motivation for the sign test) :
```{r bootstrap-2, include=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/irs.txt"
irs <- read_csv(my_url)
```
```{r bootstrap-3, fig.height=5}
ggplot(irs, aes(x=Time))+geom_histogram(bins=10)
```
- $t$ procedure for the mean would not be a good idea because the distribution is skewed.
## What *actually* matters
- It's not the distribution of the *data* that has to be approx normal (for a $t$ procedure).
- What matters is the *sampling distribution of the sample mean*.
- If the sample size is large enough, the sampling distribution will be normal enough even if the data distribution is not.
- This is why we had to consider the sample size as well as the shape.
- But how do we know whether this is the case or not? We only have *one* sample.
## The (nonparametric) bootstrap
- Typically, our sample will be reasonably representative of the population.
- Idea: pretend the sample *is* the population, and sample from it *with replacement*.
- Calculate test statistic, and repeat many times.
- This gives an idea of how our statistic might vary in repeated samples: that is, its sampling distribution.
- Called the **bootstrap distribution** of the test statistic.
- If the bootstrap distribution is approx normal, infer that the true sampling distribution also approx normal, therefore inference about the mean such as $t$ is good enough.
- If not, we should be more careful.
## Why it works
- We typically estimate population parameters by using the corresponding sample thing: eg. estimate population mean using sample mean.
- This called **plug-in principle**.
- The fraction of sample values less than a value $x$ called the **empirical distribution function** (as a function of $x$).
- By plug-in principle, the empirical distribution function is an estimate of the population CDF.
- In this sense, the sample *is* an estimate of the population, and so sampling from it is an estimate of sampling from the population.
## Bootstrapping the IRS data
- Sampling with replacement is done like this (the default sample size is as long as the original data):
```{r bootstrap-4}
boot <- sample(irs$Time, replace=T)
mean(boot)
```
- That's one bootstrapped mean. We need a whole bunch.
## A whole bunch
- Use the same idea as for simulating power:
```{r bootstrap-5, echo=F}
set.seed(457299)
```
\small
```{r bootstrap-6}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(boot_sample = list(sample(irs$Time, replace = TRUE)))
```
\normalsize
## Get the mean of each of those
\small
```{r bootstrap-7}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(boot_sample = list(sample(irs$Time, replace = TRUE))) %>%
mutate(my_mean = mean(boot_sample)) -> samples
samples
```
\normalsize
## Sampling distribution of sample mean
```{r bootstrap-8, fig.height=4}
ggplot(samples, aes(x=my_mean)) + geom_histogram(bins=10)
```
- Is that a slightly long right tail?
## Normal quantile plot
might be better than a histogram:
```{r bootstrap-9, fig.height=3.5}
ggplot(samples, aes(sample = my_mean)) +
stat_qq()+stat_qq_line()
```
- a very very slight right-skewness, but very close to normal.
## Confidence interval from the bootstrap distribution
There are two ways (at least):
- percentile bootstrap interval: take the 2.5 and 97.5 percentiles (to get the middle 95%). This is easy, but not always the best:
\small
```{r bootstrap-10}
(b_p=quantile(samples$my_mean, c(0.025, 0.975)))
```
\normalsize
- bootstrap $t$: use the SD of the bootstrapped sampling distribution as the SE of the estimator of the mean and make a $t$ interval:
\small
```{r bootstrap-11}
n <- length(irs$Time)
t_star <- qt(0.975, n-1)
b_t <- with(samples, mean(my_mean)+c(-1, 1)*t_star*sd(my_mean))
b_t
```
\normalsize
## Comparing
- get ordinary $t$ interval:
```{r bootstrap-12}
my_names=c("LCL", "UCL")
o_t <- t.test(irs$Time)$conf.int
```
- Compare the 2 bootstrap intervals with the ordinary $t$-interval:
```{r bootstrap-13}
tibble(limit=my_names, o_t, b_t, b_p)
```
- The bootstrap $t$ and the ordinary $t$ are very close
- The percentile bootstrap interval is noticeably shorter (common) and higher (skewness).
## Which to prefer?
- If the intervals agree, then they are all good.
- If they disagree, they are all bad!
- In that case, use BCA interval (over).
## Bias correction and acceleration
- this from
"An introduction to the bootstrap", by
Brad Efron and Robert J. Tibshirani.
- there is way of correcting the CI for skewness in the bootstrap distribution, called the BCa method
- complicated (see the Efron and Tibshirani book), but implemented in `bootstrap` package.
## Run this on the IRS data:
```{r bootstrap-14}
bca=bcanon(irs$Time, 1000, mean)
bca$confpoints
```
## use 2.5% and 97.5% points for CI
```{r bootstrap-15}
bca$confpoints %>% as_tibble() %>%
filter(alpha %in% c(0.025, 0.975)) %>%
pull(`bca point`) -> b_bca
b_bca
```
## Comparing
```{r bootstrap-16}
tibble(limit=my_names, o_t, b_t, b_p, b_bca)
```
- The BCA interval says that the mean should be estimated even higher than the bootstrap percentile interval does.
- The BCA interval is the one to trust.
## Bootstrapping the correlation
Recall the soap data:
```{r bootstrap-17, message=FALSE}
url <- "http://ritsokiguess.site/datafiles/soap.txt"
soap <- read_delim(url," ")
soap
```
## Scatterplot
```{r bootstrap-18, fig.height=3.75}
ggplot(soap, aes(x=speed, y=scrap, colour=line))+
geom_point()+geom_smooth(method="lm", se=F)
```
## Comments
- Line B produces less scrap for any given speed.
- For line B, estimate the correlation between speed and scrap (with a confidence interval.)
## Extract the line B data; standard correlation test xxx
```{r bootstrap-19}
soap %>% filter(line=="b") -> line_b
with(line_b, cor.test(speed, scrap))
```
```{r bootstrap-20, include=FALSE}
o_c=with(line_b, cor.test(speed, scrap))$conf.int
```
## Bootstrapping a correlation 1/2
- This illustrates a different technique: we need to keep the $x$ and $y$ values *together*.
- Sample *rows* of the data frame rather than individual values of `speed` and `scrap`:
\scriptsize
```{r bootstrap-21}
line_b %>% sample_frac(replace=T)
```
\normalsize
## Bootstrapping a correlation 2/2
1000 times:
```{r bootstrap-22}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(boot_df = list(sample_frac(line_b, replace = TRUE))) %>%
mutate(my_cor = with(boot_df, cor(speed, scrap))) -> cors
```
## A picture of this
```{r bootstrap-23, fig.height=4}
ggplot(cors, aes(x=my_cor))+geom_histogram(bins=15)
```
## Comments and next steps
- This is very left-skewed.
- Bootstrap percentile interval is:
```{r bootstrap-24}
(b_p=quantile(cors$my_cor, c(0.025, 0.975)))
```
- We probably need the BCA interval instead.
## Getting the BCA interval 1/2
- To use `bcanon`, write a function that takes a vector of row numbers and returns the correlation between `speed` and `scrap` for those rows:
\footnotesize
```{r bootstrap-25}
theta <- function(rows, d) {
d %>% slice(rows) %>% with(., cor(speed, scrap))
}
theta(1:3, line_b)
line_b %>% slice(1:3)
```
\normalsize
- That looks about right.
## Getting the BCA interval 2/2
- Inputs to `bcanon` are now:
- row numbers (1 through 12 in our case: 12 rows in `line_b`)
- number of bootstrap samples
- the function we just wrote
- the data frame:
```{r bootstrap-26}
points=bcanon(1:12, 1000, theta, line_b)$confpoints
points %>% as_tibble() %>%
filter(alpha %in% c(0.025, 0.975)) %>%
pull(`bca point`) -> b_bca
b_bca
```
## Comparing the results
```{r bootstrap-27}
tibble(limit=my_names, o_c, b_p, b_bca)
```
- The bootstrap percentile interval doesn't go down far enough.
- The BCA interval seems to do a better job in capturing the skewness of the distribution.
- The ordinary confidence interval for the correlation is very similar to the BCA one, and thus seems to be trustworthy here even though the correlation has a very skewed distribution. (`cor.test` uses the Fisher $z$ transformation which "spreads out" correlations close to 1).
## The $z$-transformed bootstrapped correlations
\small
```{r bootstrap-28}
cors %>%
mutate(z = 0.5 * log((1+my_cor)/(1-my_cor))) %>%
ggplot(aes(sample=z)) + stat_qq() + stat_qq_line()
```
\normalsize