STAC32 Assignment 4

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Power of a test

In a certain study, you believe that observations come from a normal distribution with mean 110 and SD 15, and you are interested in how likely you are to be able to reject a null hypothesis that the mean is 100, in favour of a one-sided alternative that the mean is greater than 100.

(3 points) By simulation, estimate the probability that you will be able to correctly reject the given null hypothesis in favour of the given alternative, based on a sample size of 25.

The probability that a null hypothesis is correctly rejected when the truth is something else (at least as believed by you) is the power of the test, so that is what you need to simulate.

The ingredients are:

the true population has a normal distribution with mean 110 and SD 15
the null hypothesis is that the mean is 100 (and the alternative is “greater”)
the sample size is 25.

Then, draw repeated samples of size 25 from the truth, run the \(t\)-test each time, obtain the P-values, and count how many of them are less than our presumed \(\alpha\) of 0.05.

Follow the procedure on the slide “in code” from the lecture notes, substituting the appropriate numbers. The difference from lecture is that the test is one-sided:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(25, 110, 15))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 100, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

Check carefully that your test is also one-sided.

My estimated power is 0.942 (which you need to say explicitly: don’t expect your reader to infer it from your output).

Your estimated power may be a bit different from mine, because randomness. See Extra 2 in the next question for how far off you might be from the exact power, which you are about to calculate.

(2 points) Repeat the power calculation using power.t.test. Do you get about the same answer? (Hint: use a positive value for delta.)

This is a matter of getting the right numbers in the right places, and of using the right alternative to get a one-sided test:

power.t.test(n = 25, delta = 110 - 100, sd = 15, type = "one.sample", alternative = "one.sided")


     One-sample t test power calculation 

              n = 25
          delta = 10
             sd = 15
      sig.level = 0.05
          power = 0.9443429
    alternative = one.sided

My power is calculated to be 0.944, which is actually very close to my simulated power of 0.942. (I would expect you to be less lucky than I was, though you should still be fairly close, because the estimation and calculation are of the same answer.)

This is an exact calculation, so here you should get exactly the same answer here as I do.

Extra 1: what happens if you have the numbers the other way around in delta, so that the delta input is negative?

power.t.test(n = 25, delta = 100 - 110, sd = 15, type = "one.sample", alternative = "one.sided")


     One-sample t test power calculation 

              n = 25
          delta = -10
             sd = 15
      sig.level = 0.05
          power = 5.066378e-07
    alternative = one.sided

The power is now almost zero. The actual way that power.t.test does power for a one-sided test is to do “greater” if your delta is positive, and “less” if your delta is negative,¹ so that what happened here is that you have the power of the test when the null is 100 and the true mean is 110, but with an alternative of “less”. If the true mean is actually greater than 100, you are almost never going to reject against this alternative, hence the result you see.

This is also the reason for the hint I gave you in the question.

Extra 2: You can figure out how close you can expect to be to the exact answer in your simulation. The procedure is a test (or, precisely, a confidence interval) for a population proportion. You might have seen this test in the context of an opinion poll, where the people you sample either agree or disagree with the statement made in the poll, or maybe in the context of testing that a coin or a die² is fair (it gives the right percentage of heads when tossed or 6s when rolled, for example). The common theme to these situations is that you have something you call a “success” (“agree” or “heads” or “6” from each person or coin toss or die roll) and you are interested in the probability of success, which you estimate from the proportion of successes you observe.

Estimating power is just like that: each time we generate some data from the truth and run a test on it, we have a “success” if we correctly reject the null hypothesis, and the actual true power is the probability of this “success”.

R has prop.test that is like t.test but for proportions. The input is the number of successes you observed, followed by the number of attempts you made. You can test a null-hypothesis value of the success probability as something like p = 0.5 (which is the sort of thing you would use if you wanted to see whether you had a fair coin), but here we’re only going to look at the 95% confidence interval:

prop.test(942, 1000)


    1-sample proportions test with continuity correction

data:  942 out of 1000, null probability 0.5
X-squared = 779.69, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.9251946 0.9553068
sample estimates:
    p 
0.942

Based on my simulation of 942 rejections out of 1000 (you should use your values in your version of this), I think the true power is between 0.925 and 0.955, with 95% confidence. This clearly includes (for me) the exact answer of 0.945.

I think I said in lecture that in estimating a proportion with 1000 simulations, we can expect to have a margin of error around 3 percentage points. But that only applies if the true power is around 0.50, which it very much is not here. When the power is close to 1, the confidence interval for the true power is shorter, because at the high end there is nowhere much for it to go (it makes no sense for the upper limit to be greater than 1).

(2 points) The researchers decided that a power of 0.80 would be high enough. Run one more simulation that will provide some information about how large a sample size they would need. I am not asking you to find the answer, merely to be closer to it than your first simulation was.

This time, we are in the happy situation of having too much power, and so a smaller sample size is sufficient. The simulation approach does not enable you to work out the sample size for a given power, so what you do is to try a smaller sample size, anything smaller than 25 will do, and see what happens. Make sure you say why you are choosing a sample size smaller than 25.

I decided to try \(n = 10\). Copy and paste your code from above, and change the sample size (you would do well to describe what you are doing so that your reader can follow):

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(10, 110, 15))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 100, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

As I have to admit I suspected, this is too small a sample size: the power is now only 0.626. For you, any sample size smaller than \(n = 25\) is good here, that gives you a power less than what you had before. In my case, the researchers now know that a sample size between 10 and 25 is what is needed, and I can try again to pin things down more precisely.

I doubt you did something like this, reducing the sample size only to 23:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(23, 110, 15))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 100, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

but the power in this case is still too big, and here you would say that all you know is that the sample size should be smaller than 23 as well.

I only want you to do one more simulation, and it doesn’t matter how close it gets you to the right answer, only that it gets you closer than you were before. What I want is for you to do another simulation, reusing your code from the previous one (and now that you have done it, you can see that some kind of trial and error only requires another paste and a trial of a different sample size). I do not want you to use power.t.test here. If that’s what I wanted, I would have said “calculation” rather than “simulation”.

Extra: I tried to be specific about what to do, so that you didn’t do too much work. But you can feel free, as a bonus for yourself, to see how close you can get by trying again. Copy, paste, and tinker with the sample size. Given what I did before, it looks as if 15 should be pretty close to the right answer:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(15, 110, 15))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 100, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

That is actually very close to the target. In fact, I was lucky to get as close as this: based on 1000 simulations, I would expect to get this close:

prop.test(807, 1000)


    1-sample proportions test with continuity correction

data:  807 out of 1000, null probability 0.5
X-squared = 375.77, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.7808614 0.8307411
sample estimates:
    p 
0.807

The true power could be anything between 0.78 and 0.83 based on these results. When you are close to the right answer like this, what you can do is to try again with more simulations:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(15, 110, 15))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 100, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

and this more accurate result suggests that a sample size of 15 is a little too small.

(3 points) Use power.t.test to calculate the sample size that you need for a power of 0.80. Is your result consistent with your simulations? Explain briefly.

Input the desired power along with everything else but not including n, and the sample size will be calculated:

power.t.test(power = 0.80, delta = 110 - 100, sd = 15, type = "one.sample", alternative = "one.sided")


     One-sample t test power calculation 

              n = 15.35763
          delta = 10
             sd = 15
      sig.level = 0.05
          power = 0.8
    alternative = one.sided

The required sample size is actually 16 (round up). Two points.

My various simulations said that a sample size of 25 is too big, 10 is too small, and 15 was close to the mark (maybe a little too small), and those are entirely consistent with \(n = 16\) being the right answer. The third point. (You will have two simulations to discuss, not three.)

(3 points) Draw a power curve of the power in this situation, for sample sizes from 5 to 40 in steps of 5.

Follow the steps from the lecture notes:

set up the sample sizes you want (0.5 points)
work out the power for each of those (1 point)
construct a dataframe out of the above (0.5 points)
plot the points joined by lines (1 point).

That goes like this:

ns <- seq(5, 40, 5)
ns

[1]  5 10 15 20 25 30 35 40

ans <- power.t.test(n = ns, delta = 110 - 100, sd = 15, type = "one.sample", alternative = "one.sided")

For the last step here, copy and paste your first power.t.test, the one where you were given the sample size and had to find out the power, and replace n = 25 with n = ns (or whatever you called your vector of candidate sample sizes).

Now, construct a dataframe of the sample sizes and powers, and plot it according to the specifications:

d <- tibble(n = ns, power = ans$power)
ggplot(d, aes(x = n, y = power)) + geom_point() + geom_line()

That is fine as an answer, but you can add a horizontal line at 1, as I did in the lecture notes, if you like. You can copy the piece of code from the lecture notes:

ggplot(d, aes(x = n, y = power)) + geom_point() + geom_line() +
  geom_hline(yintercept = 1, linetype = "dashed")

Be careful about copying too literally: if you save the graph in g and then display g, it looks as if you are copying without thinking. The reason I did that in the lecture notes is that I needed the code on one slide and the graph on the next one (there wasn’t room for both on one slide). But you have no such restrictions; you can and should display it right away.

Extra: you could also find the power for each sample size via a more tidyverse approach, copying from my lecture notes and editing:

tibble(n=ns) %>% rowwise() %>%
  mutate(power_output = 
           list(power.t.test(n = n, delta = 110 - 100, sd = 15, 
                             type = "one.sample",
                             alternative = "one.sided"))) %>% 
  mutate(power = power_output$power) %>% 
  ggplot(aes(x=n, y=power)) + geom_point() + geom_line() +
    geom_hline(yintercept=1, linetype="dashed")

This is also good. It is more in the spirit of “set up a dataframe with what you want, work rowwise, and then calculate stuff” that we have seen before, but if you prefer the other way, that is fine. (There is a lot of output from power.t.test, hence the list, and one of the things in there is power, which is what we actually want.)

(1 point) According to your graph, about what sample size would be needed to achieve a power of 0.90? (No explanation needed.)

Read across the power = 0.9 line (the unmarked one between 0.8 and 1.0) to see that a sample size of 20 won’t quite do it. The trend seems to cross the 0.9 power line at about \(n = 21\) or \(n = 22\). Either of these will do as an answer. (I think 22 is too big, but it is not clear to me whether the trend crosses the 0.9 line above or below 21, so the fractional sample size might come out as 21 point something, and the rounded-up sample size might be 22.)

Extra: this last is also consistent with my other work here. By simulation, a sample size of 25 gave power bigger than 0.90 (which you also see on your graph), so the required sample size should be (a bit) smaller than 25, but larger than the \(n=16\) that got us a power of 0.80.

A comparative experiment

You will be running a randomized comparative experiment (that is to say, comparing a treatment with a control). If all goes well, you expect to see a mean value of your response y in the treatment group of 45 and a mean y in the control group of 40. Previous studies have shown that y is likely to have a standard deviation of 10 in each group. You plan to use sample sizes of 30 in each group, and are curious about the power to expect from a one-sided test.

(4 points) Use power.t.test to calculate the power of your test, under the assumption that both treatment and control populations have a normal distribution. What power do you get?

This is the two-sample version of power.t.test. The things to set up are the delta, the true difference between the means, the sample size in each group (these must be the same), the SD in each group (these must also be the same), the type of test (two-sample), and the one-sidedness of the alternative:

power.t.test(delta = 45 - 40, n = 30, sd = 10, type = "two.sample",
                 alternative = "one.sided")


     Two-sample t test power calculation 

              n = 30
          delta = 5
             sd = 10
      sig.level = 0.05
          power = 0.6060253
    alternative = one.sided

NOTE: n is number in *each* group

The power of the test is 0.606. Make sure you state the answer rather than making your reader find it in the output. Three points for getting the code right, and one for stating the answer.

Most of the time, you will successfully reject the null hypothesis.

This power calculation uses the pooled test under the hood (hence the requirement for the two groups to have the same SD), because the mathematics works for the pooled test but not the Welch test.

Extra: we can (with considerable difficulty) simulate this result. There are different ways, but my take is to set up the group information first:

stats <- tribble(
  ~group, ~n, ~mu, ~sigma,
  "treatment", 30, 45, 10,
  "control",   30, 40, 10
)
stats

Then do this, some of which will look familiar and some of which won’t:

stats %>% mutate(sim = list(1:1000)) %>% 
  unnest_longer(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma))) %>% 
  unnest_longer(my_sample) %>% 
  nest_by(sim) %>% 
  mutate(t_test = list(t.test(my_sample ~ group, 
                              data = data, 
                              alternative = "less",
                              var.equal = TRUE))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  ungroup() %>% 
  count(p_value <= 0.05)

The power we get here is a little less than the exact calculation. But, the same rule of thumb applies as before: we are doing 1000 simulations, each of which can either reject or not, and so we’d expect to be (and are) within 3 percentage points of the truth.

A more detailed explanation of this code is coming at the end, so as not to clutter up these solutions too much.

(2 points) You and the other people running this experiment agree that the assumption of normal distributions is reasonable. What are two other situations that would prevent you from using power.t.test to obtain the power in an experiment like this one?

Think about the inputs to power.t.test: the sample size and the SD. These apply to both groups, and so have to be the same for both. Hence, if either of the above are not true, you cannot get the answer with power.t.test, specifically:

you have different sample sizes in the two groups
the two groups have different standard deviations.

The simulation approach I showed you the code for in an Extra to the previous part has no such limitations, so you could adapt that for either of these situations, but power.t.test is not that flexible. The reasoning is:

it is customary to use the same sample size in the two groups of a comparative study (if the SDs are about equal, this is the best allocation of experimental effort in that you will have the greatest power if the sample sizes in the two groups are the same).
the theory behind power.t.test is based on the pooled test, which assumes that the two groups have the same SD. There is no corresponding theory (to my knowledge) behind the Welch test.

(3 points) What sample sizes will get you a power of 0.75 under the assumptions given in the question? State your answer clearly.

Code-wise, take your power.t.test code from above, remove the n =, and replace it with your target power:

power.t.test(delta = 45 - 40, power = 0.75, sd = 10, type = "two.sample",
                 alternative = "one.sided")


     Two-sample t test power calculation 

              n = 43.7269
          delta = 5
             sd = 10
      sig.level = 0.05
          power = 0.75
    alternative = one.sided

NOTE: n is number in *each* group

Thus, you will need 44 observations in each group (that is, 44 in the treatment group and another 44 in the control group, for a total of 88 observations). Two points for getting the code right and one for stating the answer properly so that your colleagues who are actually going to run the experiment will know what to do.

Extra 1: Earlier, we had a sample size of 30 in each group and got power around 0.60. Our target power was somewhat more than this, and so our required sample sizes are now somewhat greater than 30. (The practical issue arising from this is now whether the budget for the experiment will extend to cover the cost of sampling 88 individuals altogether.)

Extra 2:

The simulation I did, which needs some explanation. Let me take you through it gently (but probably at great length).

The place I started was to make a little dataframe with information about the populations and sample sizes. I like tribble for this, because you lay out the data for the dataframe the same way you think of it (observations in rows, variables in columns):

stats <- tribble(
  ~group, ~n, ~mu, ~sigma,
  "treatment", 30, 45, 10,
  "control",   30, 40, 10
)
stats

The first row with the squiggles makes the column names.

Where I am going here is to see if I can do one simulation and then later scale it up.

Next, generate a random sample from each population. The way I set up with what I called n, mu, and sigma, they go into rnorm in the same order:

stats %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma)))

This should be looking familiar: each cell in my_sample holds all of a sample from the group in that row.

Usually, when we are doing these, we summarize the samples in some way: by taking their means if we are doing a bootstrap sampling distribution, or by running some kind of test if we are doing a power analysis. The problem here is that we need both samples together, and for that we need the actual values. To get those out (somebody asked about this in lecture), there is a family of functions that start with unnest. The one we want is unnest_longer, because we want to expand the columns down the page (unnest_wider expands them across the page, making a lot of new columns). The unnest functions accept the column you want to “unnest” as input, which will be a list-column:

stats %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma))) %>% 
  unnest_longer(my_sample) -> d
d

There are 60 rows altogether: the first 30 are labelled treatment and contain the simulated treatment group scores, and the last 30 contain the simulated control group scores, for one simulation.

I saved this dataframe, stopping the pipeline here, for reasons that you’ll see in a moment. The thing to notice about d is that it is now exactly the right layout for a two-sample \(t\)-test: a column of measurements in my_sample,³ and a column of group names, which got repeated to match the 30 observations in their group.

So, a two-sample \(t\). I actually set things up so that the “true” SDs of each group are identical, so the right test to use is the pooled one:⁴

t.test(my_sample ~ group, data = d, alternative = "less",
       var.equal = TRUE)


    Two Sample t-test

data:  my_sample by group
t = -2.1479, df = 58, p-value = 0.01795
alternative hypothesis: true difference in means between group control and group treatment is less than 0
95 percent confidence interval:
      -Inf -1.256144
sample estimates:
  mean in group control mean in group treatment 
               39.32497                44.98891

Two short asides here:

First, to get the P-value in a format you can use, save the t.test output in something, then access it as p.value:

ans <- t.test(my_sample ~ group, data = d, alternative = "less",
       var.equal = TRUE)
ans$p.value

[1] 0.01795493

(I think we saw this in lecture.)

Second, you can actually continue the pipeline, but the dataframe output that I saved in d is not the first input to t.test, so some care is needed:

stats %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma))) %>% 
  unnest_longer(my_sample) %>% 
  t.test(my_sample ~ group, data = ., alternative = "less",
       var.equal = TRUE)


    Two Sample t-test

data:  my_sample by group
t = -2.1376, df = 58, p-value = 0.01839
alternative hypothesis: true difference in means between group control and group treatment is less than 0
95 percent confidence interval:
      -Inf -1.228559
sample estimates:
  mean in group control mean in group treatment 
               39.39336                45.02800

The results here are different only because I drew some new random samples.⁵ The problem is that we no longer have a name for the dataframe coming out of the unnest_longer, but we need one for the data = in t.test, so what we do is to use the special name ., meaning “whatever came out of the previous step”.

All right, I’m done with the asides.⁶ What we have seen so far is how to generate one random sample from each group, run the \(t\)-test, and get hold of the P-value. Now we need to scale this up so that we can draw 1000 random samples from each group and run a two-sample (pooled) \(t\)-test on each one. It is, unfortunately, not completely obvious how to scale it up, but I will attempt to explain what I ended up doing.⁷

The usual way we begin a simulation is with something like

tibble(sim = 1:1000)

except that this time I need to combine this with the dataframe stats that I made earlier. I came up with this:

stats %>% mutate(sim = list(1:1000))

so that you see I have made a place for 1000 simulated samples from each group, or at least I will have done once I unnest_longer:

stats %>% mutate(sim = list(1:1000)) %>% 
  unnest_longer(sim)

2000 rows this time, 1000 for each of the two groups.

Now I generate some random samples, via a familiar mechanism:

stats %>% mutate(sim = list(1:1000)) %>% 
  unnest_longer(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma)))

Note that I am doing the same thing as I did before to generate samples with the right mean (and the right SD and sample size, if they had happened to be different).

At this point, we actually need to get hold of the sample values again, via another unnest_longer:

stats %>% mutate(sim = list(1:1000)) %>% 
  unnest_longer(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma))) %>% 
  unnest_longer(my_sample)

We have an absurd number of rows,⁸ the 60 rows of the two randomly-generated samples times 1000 simulations, but we do at least have everything we need now. The problem is that it’s not organized right.

The next step is going to look a bit like black magic. One of the things about becoming an experienced data analyst is that you recognize certain situations coming up again and again, so I knew what to do about this one. You can get the same intuitions if you practice enough and pay attention to what you are seeing over a period of time, but the key ingredient is practice. This is why I give you worksheets and assignments, so that you can get some of that practice and learn some of the thinking skills that you will need later.

All right:

stats %>% mutate(sim = list(1:1000)) %>% 
  unnest_longer(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma))) %>% 
  unnest_longer(my_sample) %>% 
  nest_by(sim)

What on earth did that do, and where did that thing called data come from?

Where I wanted to get was to have a dataframe containing both treatment and control within each simulation, because I know how to deal with that (this is what we started this lengthy Extra with). Nesting is the opposite of unnesting: unnesting expands a list-column so that you can see all the values, and nesting compresses a column of numbers into a list-column. This particular variant creates a list-column of dataframes (called data), one of them for each thing you nest by (which was simulation in this case). So each thing in data is actually a dataframe compressed into one cell, and in that dataframe are all the things we had just now except for the thing we nested-by (sim). The notation tibble[,5] means “a dataframe with five columns.” The dataframe we had before the nest_by had six columns: one of them was sim, which got split out, and the other five columns are in data. In particular, each dataframe in data has a group containing treatment and control, and a my_sample containing all the simulated values. Each dataframe in data has \(30+30 = 60\) rows, and there are 1000 simulations, so the “outer” dataframe we are looking at has 1000 rows. Dataframes within dataframes. Spooky.

Unnecessary aside: when I was very young, there was a popular song covered by a lot of singers⁹ called Windmills Of Your Mind. It goes like this:

Like a circle in a spiral, like a wheel within a wheel
Never ending or beginning on an ever-spinning reel
Like a snowball down a mountain or a carnival balloon
Like a carousel that’s turning, running rings around the moon
Like a clock whose hands are sweeping past the minutes of its face
And the world is like an apple whirling silently in space
Like the circles that you find, in the windmills of your mind.

The people who wrote these lyrics said, according to Wikipedia, “The lyric we wrote was stream-of-consciousness. We felt that the song had to be a mind trip of some kind.”

Anyway.

Once we have made it this far, the rest of the way is using ideas we already saw:

stats %>% mutate(sim = list(1:1000)) %>% 
  unnest_longer(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(n, mu, sigma))) %>% 
  unnest_longer(my_sample) %>% 
  nest_by(sim) %>% 
  mutate(t_test = list(t.test(my_sample ~ group, data = data,
                              alternative = "less",
                              var.equal = TRUE))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  ungroup() %>% 
  count(p_value <= 0.05)

Run a pooled \(t\)-test on the dataframe in (each row of) the column data (playing the rôle of my d earlier), pull out the P-value of each one, and then count how many of them are less than 0.05 and would correctly lead to a rejection. The only new thing here is the ungroup. nest_by comes with a rowwise attached, which is what we want until right at the end; if we are not careful, we will get a count of the number of P-values less than 0.05 on each row (which can only ever be 0 or 1), so what we need to do is to undo the (implicit) rowwise, which is what ungroup does. ungroup also undoes group_by (hence the name), so you can also use it for that if you have done a group_by and summarize, and you then don’t need the group_by any more.

My simulated power here is a bit different from my simulated power before, but they are both within chance of the answer from power.t.test.

Footnotes

The power.t.test function comes from the early days of R, and there was very much a tradition in the early code that it should do “what the user is likely to want”. If you think of delta as true minus null, then a one-sided alternative will give you the correct one-sided power whether your truth is greater or lesser than the null, under the reasonable assumption that the one side you want is the same side that your truth is compared to the null.↩︎
One die, two dice.↩︎
Which was probably not the best name, but so it is.↩︎
I wanted to give you one you could do with power.t.test, which uses a pooled test behind the scenes.↩︎
I suppose I could have avoided that by some careful setting and resetting of the random number seed.↩︎
These ones, anyway. There is another totally unnecessary one later.↩︎
I was sitting in the Civic Centre library doing this, during one of my work from home days.↩︎
The pager will only show us 10,000 rows, but there are actually 60,000 there.↩︎
One of whom was Dusty Springfield, who was most famous for a song called I Only Want To Be With You. That song was rather improbably covered by a Danish heavy metal band called Volbeat. There is a video on YouTube of them performing it live, a video that contains every heavy-metal trope you can think of, up to and including crowd-surfing.↩︎