The bootstrap

Packages for this section

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(bootstrap)

Source: Hesterberg et al (link)

Is my sampling distribution normal enough?

Recall IRS data (used as a motivation for the sign test) :

ggplot(irs, aes(x=Time))+geom_histogram(bins=10)

\(t\) procedure for the mean would not be a good idea because the distribution is skewed.

What actually matters

It’s not the distribution of the data that has to be approx normal (for a \(t\) procedure).
What matters is the sampling distribution of the sample mean.
If the sample size is large enough, the sampling distribution will be normal enough even if the data distribution is not.
- This is why we had to consider the sample size as well as the shape.
But how do we know whether this is the case or not? We only have one sample.

The (nonparametric) bootstrap

Typically, our sample will be reasonably representative of the population.
Idea: pretend the sample is the population, and sample from it with replacement.
Calculate test statistic, and repeat many times.
This gives an idea of how our statistic might vary in repeated samples: that is, its sampling distribution.
Called the bootstrap distribution of the test statistic.
If the bootstrap distribution is approx normal, infer that the true sampling distribution also approx normal, therefore inference about the mean such as \(t\) is good enough.
If not, we should be more careful.

Why it works

We typically estimate population parameters by using the corresponding sample thing: eg. estimate population mean using sample mean.
This called plug-in principle.
The fraction of sample values less than a value \(x\) called the empirical distribution function (as a function of \(x\)).
By plug-in principle, the empirical distribution function is an estimate of the population CDF.
In this sense, the sample is an estimate of the population, and so sampling from it is an estimate of sampling from the population.

Bootstrapping the IRS data

Sampling with replacement is done like this (the default sample size is as long as the original data):

boot <- sample(irs$Time, replace=T)
mean(boot)

[1] 188.8

That’s one bootstrapped mean. We need a whole bunch.

A whole bunch

Use the same idea as for simulating power:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot_sample = list(sample(irs$Time, replace = TRUE)))

# A tibble: 1,000 × 2
# Rowwise: 
     sim boot_sample
   <int> <list>     
 1     1 <dbl [30]> 
 2     2 <dbl [30]> 
 3     3 <dbl [30]> 
 4     4 <dbl [30]> 
 5     5 <dbl [30]> 
 6     6 <dbl [30]> 
 7     7 <dbl [30]> 
 8     8 <dbl [30]> 
 9     9 <dbl [30]> 
10    10 <dbl [30]> 
# ℹ 990 more rows

Get the mean of each of those

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot_sample = list(sample(irs$Time, replace = TRUE))) %>% 
  mutate(my_mean = mean(boot_sample)) -> samples
samples

# A tibble: 1,000 × 3
# Rowwise: 
     sim boot_sample my_mean
   <int> <list>        <dbl>
 1     1 <dbl [30]>     196 
 2     2 <dbl [30]>     202.
 3     3 <dbl [30]>     263.
 4     4 <dbl [30]>     173.
 5     5 <dbl [30]>     204.
 6     6 <dbl [30]>     197.
 7     7 <dbl [30]>     210.
 8     8 <dbl [30]>     160.
 9     9 <dbl [30]>     198.
10    10 <dbl [30]>     178.
# ℹ 990 more rows

Sampling distribution of sample mean

ggplot(samples, aes(x=my_mean)) + geom_histogram(bins=10)

Is that a slightly long right tail?

Normal quantile plot

might be better than a histogram:

ggplot(samples, aes(sample = my_mean)) + 
  stat_qq()+stat_qq_line()

a very very slight right-skewness, but very close to normal.

Confidence interval from the bootstrap distribution 1/2

There are two ways (at least). First way:

percentile bootstrap interval: take the 2.5 and 97.5 percentiles (to get the middle 95%). This is easy, but not always the best:

samples %>% 
  ungroup() %>% # undo the rowwise()
  reframe(ci = quantile(my_mean, c(0.025, 0.975))) %>% 
  pull(ci) -> b_p 
b_p

    2.5%    97.5% 
162.5775 246.9092

Confidence interval from the bootstrap distribution 2/2

Second way: bootstrap \(t\): use the SD of the bootstrapped sampling distribution as the SE of the estimator of the mean and make a \(t\) interval:

n <- length(irs$Time)
t_star <- qt(0.975, n-1)
samples %>% ungroup() %>% 
  summarize(boot_mean = mean(my_mean), 
                      boot_sd = sd(my_mean)) %>% 
  mutate(margin = t_star * boot_sd) %>% 
  mutate(lo = boot_mean - margin,
         hi = boot_mean + margin) -> d
b_t <- c(d$lo, d$hi)
b_t

[1] 156.5070 246.4032

Comparing

get ordinary \(t\) interval:

my_names=c("LCL", "UCL")
o_t <- t.test(irs$Time)$conf.int

Compare the 2 bootstrap intervals with the ordinary \(t\)-interval:

tibble(limit=my_names, o_t, b_t, b_p)

# A tibble: 2 × 4
  limit   o_t   b_t   b_p
  <chr> <dbl> <dbl> <dbl>
1 LCL    155.  157.  163.
2 UCL    247.  246.  247.

The bootstrap \(t\) and the ordinary \(t\) are very close
The percentile bootstrap interval is noticeably shorter (common) and higher (skewness).

Which to prefer?

If the intervals agree, then they are all good.
If they disagree, they are all bad!
In that case, use BCA interval (over).

Bias correction and acceleration

this from “An introduction to the bootstrap”, by Brad Efron and Robert J. Tibshirani.
there is way of correcting the CI for skewness in the bootstrap distribution, called the BCa method
complicated (see the Efron and Tibshirani book), but implemented in bootstrap package.

Run this on the IRS data:

bca <- bcanon(irs$Time, 1000, mean)
bca$confpoints

     alpha bca point
[1,] 0.025  161.8333
[2,] 0.050  168.0667
[3,] 0.100  174.8333
[4,] 0.160  180.7667
[5,] 0.840  224.1333
[6,] 0.900  232.3000
[7,] 0.950  241.9333
[8,] 0.975  253.7333

use 2.5% and 97.5% points for CI

bca$confpoints %>% as_tibble() %>% 
  filter(alpha %in% c(0.025, 0.975)) %>% 
  pull(`bca point`) -> b_bca
b_bca

[1] 161.8333 253.7333

Comparing

tibble(limit=my_names, o_t, b_t, b_p, b_bca)

# A tibble: 2 × 5
  limit   o_t   b_t   b_p b_bca
  <chr> <dbl> <dbl> <dbl> <dbl>
1 LCL    155.  157.  163.  162.
2 UCL    247.  246.  247.  254.

The BCA interval says that the mean should be estimated even higher than the bootstrap percentile interval does.
The BCA interval is the one to trust.

Bootstrapping the correlation

Recall the soap data:

url <- "http://ritsokiguess.site/datafiles/soap.txt"
soap <- read_delim(url," ")
soap

# A tibble: 27 × 4
    case scrap speed line 
   <dbl> <dbl> <dbl> <chr>
 1     1   218   100 a    
 2     2   248   125 a    
 3     3   360   220 a    
 4     4   351   205 a    
 5     5   470   300 a    
 6     6   394   255 a    
 7     7   332   225 a    
 8     8   321   175 a    
 9     9   410   270 a    
10    10   260   170 a    
# ℹ 17 more rows

Scatterplot

ggplot(soap, aes(x=speed, y=scrap, colour=line))+
  geom_point()+geom_smooth(method="lm", se=F)

`geom_smooth()` using formula = 'y ~ x'

Comments

Line B produces less scrap for any given speed.
For line B, estimate the correlation between speed and scrap (with a confidence interval.)

Extract the line B data; standard correlation test

soap %>% filter(line=="b") -> line_b
with(line_b, cor.test(speed, scrap))


    Pearson's product-moment correlation

data:  speed and scrap
t = 15.829, df = 10, p-value = 2.083e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9302445 0.9947166
sample estimates:
      cor 
0.9806224

Bootstrapping a correlation 1/2

This illustrates a different technique: we need to keep the \(x\) and \(y\) values together.
Sample rows of the data frame rather than individual values of speed and scrap:

line_b %>% sample_frac(replace=TRUE)

# A tibble: 12 × 4
    case scrap speed line 
   <dbl> <dbl> <dbl> <chr>
 1    24   252   155 b    
 2    22   260   200 b    
 3    16   140   105 b    
 4    25   422   320 b    
 5    16   140   105 b    
 6    19   341   255 b    
 7    19   341   255 b    
 8    19   341   255 b    
 9    17   277   215 b    
10    16   140   105 b    
11    20   215   175 b    
12    18   384   270 b

Bootstrapping a correlation 2/2

1000 times:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot_df = list(sample_frac(line_b, 
                                    replace = TRUE))) %>% 
  mutate(my_cor = 
           with(boot_df, cor(speed, scrap))) -> cors

A picture of this

ggplot(cors, aes(x=my_cor))+geom_histogram(bins=15)

Comments and next steps

This is very left-skewed.
Bootstrap percentile interval is:

cors %>% ungroup() %>% 
  reframe(ci = quantile(my_cor, c(0.025, 0.975))) %>% 
  pull(ci) -> b_p
b_p

     2.5%     97.5% 
0.9415748 0.9962462

We probably need the BCA interval instead.

Getting the BCA interval 1/2

To use bcanon, write a function that takes a vector of row numbers and returns the correlation between speed and scrap for those rows:

theta <- function(rows, d) {
  d %>% slice(rows) %>% with(., cor(speed, scrap))
}
theta(1:3, line_b)

[1] 0.9928971

line_b %>% slice(1:3)

# A tibble: 3 × 4
   case scrap speed line 
  <dbl> <dbl> <dbl> <chr>
1    16   140   105 b    
2    17   277   215 b    
3    18   384   270 b

That looks about right.

Getting the BCA interval 2/2

Inputs to bcanon are now:
- row numbers (1 through 12 in our case: 12 rows in line_b)
- number of bootstrap samples
- the function we just wrote
- the data frame:

points=bcanon(1:12, 1000, theta, line_b)$confpoints
points %>% as_tibble() %>% 
  filter(alpha %in% c(0.025, 0.975)) %>% 
  pull(`bca point`) -> b_bca
b_bca

[1] 0.9314334 0.9947799

Comparing the results

tibble(limit=my_names, o_c, b_p, b_bca)

# A tibble: 2 × 4
  limit   o_c   b_p b_bca
  <chr> <dbl> <dbl> <dbl>
1 LCL   0.930 0.942 0.931
2 UCL   0.995 0.996 0.995

The bootstrap percentile interval doesn’t go down far enough.
The BCA interval seems to do a better job in capturing the skewness of the distribution.
The ordinary confidence interval for the correlation is very similar to the BCA one, and thus seems to be trustworthy here even though the correlation has a very skewed distribution. (cor.test uses the Fisher \(z\) transformation which “spreads out” correlations close to 1).

The \(z\)-transformed bootstrapped correlations

cors %>% 
  mutate(z = 0.5 * log((1+my_cor)/(1-my_cor))) %>% 
  ggplot(aes(sample=z)) + stat_qq() + stat_qq_line()