---
title: "Power of hypothesis tests"
editor:
markdown:
wrap: 72
---
## Packages
```{r}
#| message: false
library(tidyverse)
```
## Errors in testing
What can happen:
| | Decision | |
|:---------------|:------------------|:----------------|
| **Truth** | **Do not reject** | **Reject null** |
| **Null true** | Correct | Type I error |
| **Null false** | Type II error | Correct |
Tension between truth and decision about truth (imperfect).
- Prob. of type I error denoted $\alpha$. Usually fix $\alpha$, eg.
$\alpha = 0.05$.
- Prob. of type II error denoted $\beta$. Determined by the planned
experiment. Low $\beta$ good.
- Prob. of not making type II error called **power** (= $1 - \beta$).
*High* power good.
## Power 1/2
- Suppose $H_0 : \theta = 10$, $H_a : \theta \ne 10$ for some
parameter $\theta$.
- Suppose $H_0$ wrong. What does that say about $\theta$?
- Not much. Could have $\theta = 11$ or $\theta = 8$ or
$\theta = 496$. In each case, $H_0$ wrong.
## Power 2/2
- How likely a type II error is depends on what $\theta$ is:
- If $\theta = 496$, should reject $H_0 : \theta = 10$
even for small sample, so $\beta$ small (power large).
- If $\theta = 11$, hard to reject $H_0$ even with
large sample, so $\beta$ would be larger (power smaller).
- Power depends on true parameter value, and on sample size.
- So we play "what if": "if $\theta$ were 11 (or 8 or 496), what would
power be?".
## Figuring out power
- Time to figure out power is before you collect any data, as part of
planning process.
- Need to have idea of what kind of departure from null hypothesis of
interest to you, eg. average improvement of 5 points on reading test
scores. (Subject-matter decision, not statistical one.)
- Then, either:
- "I have this big a sample and this big a departure I want to
detect. What is my power for detecting it?"
- "I want to detect this big a departure with this much power. How
big a sample size do I need?"
## How to understand/estimate power?
- Suppose we test $H_0 : \mu = 10$ against $H_a : \mu \ne 10$, where
$\mu$ is population mean.
- Suppose in actual fact, $\mu = 8$, so $H_0$ is wrong. We want to
reject it. How likely is that to happen?
- Need population SD (take $\sigma = 4$) and sample size (take
$n = 15$). In practice, get $\sigma$ from pilot/previous study, and
take the $n$ we plan to use.
- Idea: draw a random sample from the true distribution, test whether
its mean is 10 or not.
- Repeat previous step "many" times: simulation.
## Making it go
- Random sample of 15 normal observations with mean 8 and SD 4:
```{r}
#| echo: false
set.seed(457299)
```
```{r}
x <- rnorm(15, 8, 4)
x
```
- Test whether `x` from population with mean 10 or not (over):
## ...continued
```{r}
t.test(x, mu = 10)
```
- P-value 0.081, so fail to reject the mean being 10 (a Type II error).
## or get just P-value
```{r}
ans <- t.test(x, mu = 10)
ans$p.value
```
## Run this lots of times via simulation
- draw random samples from the truth
- test that $\mu = 10$
- get P-value
- Count up how many of the P-values are 0.05 or less.
## In code
```{r inference-2-R-5, echo=FALSE}
set.seed(457299)
```
```{r inference-2-R-6}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(my_sample = list(rnorm(15, 8, 4))) %>%
mutate(t_test = list(t.test(my_sample, mu = 10))) %>%
mutate(p_val = t_test$p.value) %>%
count(p_val <= 0.05)
```
We correctly rejected 422 times out of 1000, so the estimated power is
0.422.
## Try again with bigger sample
```{r inference-2-R-6a}
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(my_sample = list(rnorm(40, 8, 4))) %>%
mutate(t_test = list(t.test(my_sample, mu = 10))) %>%
mutate(p_val = t_test$p.value) %>%
count(p_val <= 0.05)
```
Power is (much) larger with a bigger sample.
## How accurate is my simulation?
- At our chosen $\alpha$, each simulated test independently either rejects or not with some probability $p$ that I am trying to estimate (the power)
- Estimating a population probability using the sample proportion (the number of simulated rejections out of the number of simulated tests)
- hence, `prop.test`.
- inputs: number of rejections, number of simulations.
## Sample size 15, rejected 422 times
```{r}
prop.test(422, 1000)
```
95% CI for power: 0.391 to 0.453
## To estimate power more accurately
- Run more *simulations*:
Change 1000 to eg 10,000:
```{r}
tibble(sim = 1:10000) %>%
rowwise() %>%
mutate(my_sample = list(rnorm(15, 8, 4))) %>%
mutate(t_test = list(t.test(my_sample, mu = 10))) %>%
mutate(p_val = t_test$p.value) %>%
count(p_val <= 0.05)
```
## Accuracy of power now
```{r}
prop.test(4353, 10000)
```
0.426 to 0.445, about factor $\sqrt{10}$ shorter because number of simulations 10 times bigger.
## Calculating power
- Simulation approach very flexible: will work for any test. But
answer different each time because of randomness.
- In some cases, for example 1-sample and 2-sample t-tests, power can
be calculated.
- `power.t.test`. Input `delta` is difference between null and true
mean:
```{r inference-2-R-7}
power.t.test(n = 15, delta = 10-8, sd = 4, type = "one.sample")
```
## Comparison of results
| Method | Power |
|:-------------------|:-------|
| Simulation (10000) | 0.4353 |
| **`power.t.test`** | 0.4378 |
- Simulation power is similar to calculated power; to get more
accurate value, repeat more times (eg. 100,000 instead of 10,000),
which takes longer.
- With this small a sample size, the power is not great. With a bigger
sample, the sample mean should be closer to 8 most of the time, so
would reject $H_0 : \mu = 10$ more often.
## Calculating required sample size
- Often, when planning a study, we do not have a particular sample
size in mind. Rather, we want to know how big a sample to take. This
can be done by asking how big a sample is needed to achieve a
certain power.
- The simulation approach does not work naturally with this, since you
have to supply a sample size.
- For that, you try different sample sizes until you get power
close to what you want.
- For the power-calculation method, you supply a value for the power,
but leave the sample size missing.
- Re-use the same problem: $H_0 : \mu = 10$ against 2-sided
alternative, true $\mu = 8$, $\sigma = 4$, but now aim for power
0.80.
## Using power.t.test
- No `n=`, replaced by a `power=`:
```{r inference-2-R-8}
power.t.test(power=0.80, delta=10-8, sd=4, type="one.sample")
```
- Sample size must be a whole number, so round up to 34 (to get at
least as much power as you want).
## Power curves
- Rather than calculating power for one sample size, or sample size
for one power, might want a picture of relationship between sample
size and power.
- Or, likewise, picture of relationship between difference between
true and null-hypothesis means and power.
- Called power curve.
- Build and plot it yourself.
## Building it:
```{r inference-2-R-14}
tibble(n=seq(10, 100, 10)) %>% rowwise() %>%
mutate(power_output =
list(power.t.test(n = n, delta = 10-8, sd = 4,
type = "one.sample"))) %>%
mutate(power = power_output$power) %>%
ggplot(aes(x=n, y=power)) + geom_point() + geom_line() +
geom_hline(yintercept=1, linetype="dashed") -> g2
```
## The power curve
```{r inference-2-R-15}
g2
```
## Power curves for means
- Can also investigate power as it depends on what the true mean is
(the farther from null mean 10, the higher the power will be).
- Investigate for two different sample sizes, 15 and 30.
- First make all combos of mean and sample size:
```{r inference-2-R-16}
means <- seq(6,10,0.5)
ns <- c(15,30)
combos <- crossing(mean=means, n=ns)
```
## The combos
\scriptsize
```{r inference-2-R-17}
combos
```
\normalsize
## Calculate powers:
```{r inference-2-R-18}
combos %>%
rowwise() %>%
mutate(power_stuff = list(power.t.test(n=n, delta=10-mean, sd=4,
type="one.sample"))) %>%
mutate(power = power_stuff$power) -> powers
```
## then make the plot:
```{r inference-2-R-21}
g <- ggplot(powers, aes(x = mean, y = power, colour = factor(n))) +
geom_point() + geom_line() +
geom_hline(yintercept = 1, linetype = "dashed") +
geom_vline(xintercept = 10, linetype = "dotted")
```
- Need `n` as categorical so that `colour` works properly.
## The power curves
```{r inference-2-R-22, fig.height=3.8}
g
```
## Comments
- When `mean=10`, that is, the true mean equals the null mean, $H_0$
is actually true, and the probability of rejecting it then is
$\alpha = 0.05$.
- As the null gets more wrong (mean decreases), it becomes easier to
correctly reject it.
- The blue power curve is above the red one for any mean \< 10,
meaning that no matter how wrong $H_0$ is, you always have a greater
chance of correctly rejecting it with a larger sample size.
- Previously, we had $H_0 : \mu = 10$ and a true $\mu = 8$, so a mean
of 8 produces power 0.42 and 0.80 as shown on the graph.
- With $n = 30$, a true mean that is less than about 7 is almost
certain to be correctly rejected. (With $n = 15$, the true mean
needs to be less than 6.)
## Two-sample power
```{r inference-2-R-25, echo=FALSE}
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
```
- For kids learning to read, had sample sizes of 22 (approx) in each
group
- and these group SDs:
```{r inference-2-R-26}
kids %>% group_by(group) %>%
summarize(n=n(), s=sd(score))
```
## Setting up
- suppose a 5-point improvement in reading score was considered
important (on this scale)
- in a 2-sample test, null (difference of) mean is zero, so `delta` is
true difference in means
- what is power for these sample sizes, and what sample size would be
needed to get power up to 0.80?
- SD in both groups has to be same in `power.t.test`, so take as 14.
## Calculating power for sample size 22 (per group)
```{r pow1}
power.t.test(n=22, delta=5, sd=14, type="two.sample",
alternative="one.sided")
```
## sample size for power 0.8
```{r pow2}
power.t.test(power=0.80, delta=5, sd=14, type="two.sample",
alternative="one.sided")
```
## Comments
- The power for the sample sizes we have is very small (to detect a
5-point increase).
- To get power 0.80, we need 98 kids in *each* group!