Worksheet 5

Published

October 8, 2025

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Packages you’ll need:

library(tidyverse)

Power and the exponential distribution

The exponential distribution is often used to model lifetimes of things like light bulbs or electronic components. It is not normal (bell-curved) in shape. In R, the function rexp will draw a random sample from an exponential distribution. It has two inputs, the sample size, and the “rate” (which is one over the mean).

  1. Draw a random sample of 100 observations from an exponential distribution with mean 5, saved as one column of a dataframe, and make a histogram of your sample. In what way is the distribution not normal in shape? Hint 1: tibble with the same sort of code as in a mutate will create a dataframe from scratch. Hint 2: use a set.seed first so that your random sample won’t change from one run to the next. You can use any number as input to the set.seed.

This is my set.seed:

set.seed(457299)

You should probably get into the habit of using your own random number seed.

Use tibble and the definition to create a dataframe. The second input is \(1/5 = 0.2\); it is easiest to let R do the calculation for you:

expo <- tibble(x = rexp(100, 1/5))

We don’t know what the application is, so you can use “indefinite” names like mine.

A histogram via code such as this:

ggplot(expo, aes(x = x)) + geom_histogram(bins = 8)

This number of bins shows the shape well for me. Yours will be different, so you might need a different number of bins.

This fails to be normally-distributed because it is skewed to the right. (Yours should have the same shape even if it looks a bit different. The sample size is big, so your sample should be a reasonable representation of the actual distribution, which has this shape.)

  1. For your sample, obtain a bootstrap sampling distribution of the sample mean. Comment briefly on its shape.

This is the same procedure as in lecture. Make a new dataframe to hold the simulations, and then sample from your one column in your dataframe (whatever you called the column and the dataframe). Mine, therefore, goes like this:

tibble(sim = 1:10000) %>%
  rowwise() %>% 
  mutate(my_sample = list(sample(expo$x, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

I used 10,000 simulations because I am drawing a normal quantile plot at the end, and if you use fewer simulations, you can get deceived by the tails then. You could do fewer simulations and then draw a histogram, but then you have the more difficult job of assessing the histogram for normality (“is it bell-shaped?”). If you draw a normal quantile plot, all you have to do is to compare the points and the line.

Mine looks very slightly right-skewed (it has a slight but definite curve), but it is close to normal (close to the line). Yours will be different because you’ll have a different random sample, but I’m expecting that you will see something very similar.

Extra: these questions were from an old assignment, and at that time I didn’t show you the normal quantile plot until later in the course. Had I not asked you to draw a histogram, you could reasonably also have drawn a normal quantile plot of the original data, which looks like this (for me):

ggplot(expo, aes(sample = x)) + stat_qq() + stat_qq_line()

This perhaps makes the point better that the original data were very right-skewed (a big-time curve), but the bootstrap sampling distribution of the sample mean is much closer to normal, because we have a large sample (\(n = 100\)) and therefore the Central Limit Theorem is helping a lot. You might say that a sample of size 100 is big enough, or almost big enough, to overcome even that considerable right-skewness.

  1. Explain briefly why it makes sense to use a (one-sample) \(t\)-test for the population mean, even though the population does not appear to have a normal distribution.

The sampling distribution of the sample mean is normal (or close to normal) even though the distribution of the sample is very much not: the sample size is (almost) large enough to overcome the non-normality in the sample. (“The sample is large” is not a complete answer, because it doesn’t say how the large sample size helps.)

If your simulated sampling distribution is skewed right too much for your tastes, you’ll have to use words like “close to normal” to justify using a \(t\)-test here (which would imply that the P-value is close to correct even if it is not bang on).

  1. Estimate by simulation the power of a one-sample \(t\)-test to reject the null hypothesis that the population mean is 6 (against a two-sided alternative), when sampling from an exponential distribution with mean 5 and using a sample of size 100.

This one has the same ingredients as the one in lecture, but is not exactly the same, because we are not sampling from a normal distribution here:

tibble(sim = 1:1000) %>% 
  rowwise() %>%
  mutate(my_sample = list(rexp(100, 1/5))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 6))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

The steps:

  • set up a dataframe with places for 1000 simulations (or more)
  • work rowwise
  • generate random samples from the true distribution, exponential with the true mean (of 5). This is the same code (inside the list) as you wrote earlier.
  • run a \(t\)-test on each of your simulated samples, testing that the mean is 6 (the null mean, which we know to be wrong). The output from t.test is a collection of things (the P-value, the sample mean, a confidence interval, etc.) so this needs to be wrapped in list (and will make a list-column).
  • pull out the P-value from each \(t\)-test, which is a single number each time, so no list needed
  • count how many of those P-values would lead to (correct) rejection of the (incorrect) null.

My estimated power is 0.541. I would expect yours to be somewhere within about 0.030 of that, with 1000 simulations. (If it is not, give your code a good look over, and make sure it is correct.)

Extra: Where did the 6 come from? In this case, out of my head,1 but in general, this is the “subject matter knowledge” thing: a subject matter expert tells us what null hypothesis would usually be tested in this situation (the 6) and what sort of difference from that would be considered a scientifically interesting departure (the 5, 1 away).

I wanted to lead you through the process, so I started you off by generating a random sample from what was going to be the truth and introduced the null mean of 6 later, which is perhaps backwards. The logical way to think about power is that the null hypothesis represents the “standard theory” (6 here) and we are trying to see how likely we are to reject that null hypothesis when the truth is actually something else far enough from that null hypothesis to be of interest (5 here).

  1. Suppose our aim is to estimate (by simulation) the sample size needed to get a power of 0.75 in this same situation. By copying and pasting your code and making a small edit to it, run another simulation that will go some way towards meeting that aim. (This was previously an assignment question, so I made it definite about what to do, but on a worksheet, feel free to experiment further after you have tried your one more simulation.)

In my case (and probably yours), the power came out less than 0.75, so to achieve this much power, we will need a bigger sample size. The problem with the simulation approach is that we have no idea how much bigger the sample size needs to be, so to answer this question, copy and paste your code and change the sample size in the rexp line to any number at all bigger than 100, for example:

tibble(sim = 1:1000) %>% 
  rowwise() %>%
  mutate(my_sample = list(rexp(200, 1/5))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 6))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

This will get your power closer to 0.75. It doesn’t have to be much closer; any closer at all will do. I actually overshot some, so my next guess (if I were to make one) would be a sample size less than 200, but closer to 200 than 100 (say, something like 180). There is randomness here, so I can only hope to get so close. I actually was lucky enough to do pretty well.

You only needed one more simulation of this kind. You can do more if you want (but see the Extra below).

If your first simulation for some reason gave you a power greater than 0.75, your second guess at the sample size should be something less than 100 (again, it doesn’t matter exactly what).

Extra: I said we “can only get so close”. To think about how close we might be to the “true power” (whatever it is), note that each simulated sample either leads to us correctly rejecting the null (success) or failing to reject the null and making a type II error (failure). Then we have 1000 of those repeated trials, and we count how many times we succeeded. My language here (“trial” with “success” and “failure”) is intended to remind you of a binomial distribution. Here we have \(n = 1000\) trials (simulated samples), and a probability \(p\) of correctly rejecting the null (that is the power we are trying to estimate). You might have run into a confidence interval formula for \(p\), based on the number of trials and successes observed; this is encoded in the R function prop.test, which takes (for a confidence interval) two inputs, the number of successes and the number of trials.

For my first simulation, I got 541 successful rejections, so a 95% confidence interval for the true power is

prop.test(541, 1000)

    1-sample proportions test with continuity correction

data:  541 out of 1000, null probability 0.5
X-squared = 6.561, df = 1, p-value = 0.01042
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5095159 0.5721653
sample estimates:
    p 
0.541 

From this simulation, the power is most likely between 0.510 and 0.572. (This is where I got my “within 0.030” above from.)

For my second simulation, I got 781 successful rejections out of 1000:

prop.test(781, 1000)

    1-sample proportions test with continuity correction

data:  781 out of 1000, null probability 0.5
X-squared = 314.72, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.7537994 0.8060081
sample estimates:
    p 
0.781 

The confidence interval is between 0.754 and 0.806. We were aiming for a power of 0.75, so now we know that a sample size of 200 is indeed too big, but it may not be much too big.

To get this confidence interval shorter, we need to do more simulations, like 10,000. Let me have a third go, using a sample size of 180 and 10,000 simulations:

tibble(sim = 1:10000) %>% 
  rowwise() %>%
  mutate(my_sample = list(rexp(180, 1/5))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 6))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

That you can see is really close (0.7437) to 0.75. We would expect the confidence interval for the true power to (i) be shorter (more simulations), and (ii) to contain 0.75:

prop.test(7437, 10000)

    1-sample proportions test with continuity correction

data:  7437 out of 10000, null probability 0.5
X-squared = 2374.6, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.7350000 0.7522117
sample estimates:
     p 
0.7437 

Between 0.735 and 0.752. We have now nailed down the power rather precisely, and we are not likely to get any closer than \(n = 180\) as a sample size to achieve this power.

Child psychology

According to research in child psychology, working mothers spend a mean time of 11 minutes per day talking to their children, with a standard deviation of 2.3 minutes. Your research suggests that the mean should be greater than that, and you are planning a study of working mothers (who work outside the home) to see how many minutes they spend talking to their children. You think that the mean should be 12 minutes per day, and you want to design your study so that a mean of 11 should be rejected with a reasonably high probability.

  1. If you interview 20 working mothers, what is the power of your test if your thought about the mean is correct? Estimate by simulation. Assume that the time that a mother spends talking to her children has a normal distribution.

The true mean (as far as you are concerned) is 12, so sample from a normal distribution with that mean (and the same standard deviation as the other research, since you don’t have a better value).2 The null mean is the one from the previous research, namely 11 (the one you want to reject), and because you want to prove that the mean is in fact greater than 11, you need an upper-tailed alternative. That leads to this:

Then follow the familiar six lines of code:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(20, 12, 2.3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

The estimated power is \(567/1000 = 0.567\). By interviewing 20 working mothers, you are likely to be able to reject the null hypothesis that the mean time spent talking to their children is 11 minutes per day, in favour of an alternative that the mean is larger than this.

Your answer will likely differ from mine by a bit (see below for how big “a bit” might be), but it should be somewhere near this.

The code:

  • set up a data frame with places for 1000 (or 10,000 or whatever) simulated samples
  • work one row at a time from here forward
  • draw a random sample of size 20 from the truth (as far as you know what it is) for each row
  • run a \(t\)-test testing whether the mean is 11 for each row (which you know, or at least think, it isn’t, but the onus is on your research to prove itself over the current state of affairs)
  • get the P-value from each test
  • count how many of those P-values are less than or equal to 0.05. The number of times this is true, divided by your number of simulations, is your estimate of the power.

Extra: you can get a sense of how far off the simulation might have been by counting rejection as a “success” and non-rejection as a “failure”; my simulation gave 567 successes in 1000 trials, and a 95% confidence interval for the probability of success comes from a formula you might have run into, based on the normal approximation to the binomial, or this:

prop.test(567, 1000)

    1-sample proportions test with continuity correction

data:  567 out of 1000, null probability 0.5
X-squared = 17.689, df = 1, p-value = 2.601e-05
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5355888 0.5978900
sample estimates:
    p 
0.567 

A 95% CI for the true power of the test is from 0.536 to 0.598.

With 1000 simulations, you’ll see that the confidence interval goes about 3 percentage points up and down from the estimate. A rule of thumb is that confidence intervals for proportions based on samples of size 1000 will have a margin of error of about 3 percentage points (as long as the true proportion is somewhere near 0.5). Check this on the next opinion poll result you see: usually it will say something like “accurate to within 3 percentage points 19 times out of 20”, which is their calculated margin of error for a 95% CI. If your estimated-by-simulation power comes out between about 0.56 and about 0.62, that is as expected. If it doesn’t, either you were unlucky or you have a coding error (such as, forgetting to do the test one-sided).

If you did 10,000 simulated \(t\)-tests instead of my 1000, you will have a shorter CI, for example:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(20, 12, 2.3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

and

prop.test(5835, 10000)

    1-sample proportions test with continuity correction

data:  5835 out of 10000, null probability 0.5
X-squared = 278.56, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5737574 0.5931781
sample estimates:
     p 
0.5835 

This goes to show that the extra time you spend waiting for 10,000 simulations pays off in the accuracy of your estimated power. We can now be pretty sure that the power is not as big as 0.6.

  1. Explain briefly why power.t.test can be used to calculate an answer to this problem, and use it to check your result in the previous part.

We can use power.t.test for two reasons: (i) we actually are doing a \(t\)-test (for a population mean), and (ii) we are assuming that the times spent talking to children have a normal distribution.

For power.t.test we need to specify the sample size n, the difference between the true and null means delta, the population SD sd, the type of test (a one-sample \(t\)), and whether it is one- or two-sided:

power.t.test(n = 20, delta = 12 - 11, sd = 2.3, type = "one.sample", alternative = "one.sided")

     One-sample t test power calculation 

              n = 20
          delta = 1
             sd = 2.3
      sig.level = 0.05
          power = 0.5908187
    alternative = one.sided

If you are like me, you probably typed alternative = "greater", as you would with t.test, but then the error message will tell you what you actually need, either one.sided or two.sided. You need to be aware that information about the one-sidedness of your test has to find its way into power.t.test somehow; it turns out that it doesn’t matter which kind of one-sided test you have (greater or less), just that you have a one-sided test at all.3

The power here is actually 0.590. My simulated value of 0.567 was close, as in within three percentage points. Your simulated value in the previous question will probably be different from mine, but it ought to be somewhere near to 0.590. (See the Extra in the previous part for how close you would expect to be.)

\(\blacksquare\)

  1. A standard level of power in studies in child psychology is 80% (0.80). According to what you have seen so far, is it necessary to interview 20 working mothers, or more, or less? Explain briefly. Use power.t.test to obtain an appropriate sample size, under the assumptions you have made.

Solution

Our power is smaller than the target power of 0.8, so our sample size should be bigger than 20. (A bigger sample size will give a bigger power.)

Use power.t.test, inserting a value for power, and leaving out n (since that is what we want the function to find for us):

power.t.test(power = 0.80, delta = 12 - 11, sd = 2.3, type = "one.sample", alternative = "one.sided")

     One-sample t test power calculation 

              n = 34.10007
          delta = 1
             sd = 2.3
      sig.level = 0.05
          power = 0.8
    alternative = one.sided

Round the sample size up to 35. We need to interview 35 working mothers to get 80% power.

Extra: Psychologists (and other people) like to quantify things in terms of “effect size”, which is a difference in means divided by a standard deviation (like a \(z\)-score but applied to means), which some people know of as “Cohen’s \(d\)”:

d <- (12 - 11)/ 2.3
d
[1] 0.4347826

According to this website, an effect size of 0.2 is considered “small”, 0.5 “moderate” and 0.8 “large”, so what we have here is a “moderate effect”, one that needs a moderate sample size to provide convincing evidence of.4

Child psychology, revisited

This is a continuation of an earlier question about talking to children.

  1. Another distribution that might be suitable for time spent talking to children is the gamma distribution. Values from a gamma distribution are guaranteed to be greater than zero (which is suitable for times spent talking to children). As far as R is concerned, a random value from a gamma distribution is generated using the function rgamma. This, for us, has three inputs: the number of values to generate, a parameter called shape for which we will use the value 27.23, and a parameter called scale for which we will use the value 0.44. Generate a random sample of 1000 values from a gamma distribution with the given parameter values. Hint: make sure that the inputs that need names actually have names, and organize your results as a column in a dataframe.

Solution

Following the hint, something like this:

d <- tibble(g = rgamma(1000, shape = 27.23, scale = 0.44))
d

Displaying the dataframe as usual shows the first 10 rows, which once again will enable anyone reading your work to see that you have the right kind of thing.

I am apparently using my standard “temporary dataframe name” of d. Give it and the column in it whatever names you think make sense. For example, you might use gamma for the column and gammas (plural) for the dataframe.

\(\blacksquare\)

  1. Find the mean and SD of your random sample of values from the gamma distribution. Are the mean and SD somewhere close to the mean and SD you used in your first power analysis? Explain (very) briefly.

Solution

The first thing is the mean and SD of your random values. Starting from a dataframe, this is summarize:

d %>% summarize(g_mean = mean(g), g_sd = sd(g))

The mean and SD we used in the first simulation of power (in 1(a)) were 12 and 2.3 respectively, and these (for me) are very close to that. (Calculating the mean and SD and saying just “these are close” will confuse your reader, so make sure you say what they are supposed to be close to.)

Extra: I actually defined the shape and scale for your gamma distribution to be the ones that would give you the right mean and SD on average, and so the only reason they didn’t come out exactly the same is randomness, in your drawing of 1000 random values. Asking you to obtain 1000 random values (rather than fewer) should mean that your mean and SD come out close to the right thing with high probability. More values will also give you a better picture (next part).

If you look in the help for rgamma, by typing ?rgamma in the console (the help comes out bottom right), you’ll see (hiding in the Details) that the mean of a gamma distribution is shape times scale, and the variance (SD squared) is shape times scale-squared. This means that if you know shape and scale, you also know what the mean and SD of the distribution are. But our problem is the opposite way around: we know what mean and SD we want, and we want to find the shape and scale that produce them. This actually looks as if you can solve it using algebra, but I am too lazy to do algebra tonight, so I want to show you how I actually did it, which is better than the trial and error you would otherwise guess.

The starting point is a function, rather unimaginatively called f, that takes as input a scale and shape, gets hold of the mean and SD that go with that scale and shape, and then sees how close we are to the 12 and 2.3 we were aiming for. It looks like this:

f <- function(x) {
  shape <- x[1]
  scale <- x[2]
  mean <- shape * scale
  variance <- shape * scale^2
  sd <- sqrt(variance)
  (mean - 12)^2 + (sd - 2.3)^2
}

The input is actually a vector called x which contains the shape as its first thing and the scale as the second one. If you have not seen an R function up close before, this is what one looks like. On the top line is the name of the function (f), the word function, and its input (here just x which actually contains two values, the first one being the shape and the second one the scale.) Then, inside curly brackets, the actual business of the function. If you have written a function in Python before, you might remember that they start with def and use indentation rather than curly brackets. They also use return to mark the end result of the calculation that is sent back to the outside world. R is a bit different; the last line of a function is typically a calculation, and the result of that calculation is what is sent back to the outside world.

To take you through my function:

  • first pull the scale and shape out of the input x (so that I remember which is which)
  • work out the mean corresponding to the input shape and scale
  • work out the variance ditto
  • work out the SD from the variance (by taking its square root)
  • calculate something that will be zero if the mean is exactly 12 and the SD is exactly 2.3, and something bigger than zero otherwise. The idea is that the function tells me how close I am to getting the mean and SD right from my input shape and scale. For example:
f(c(3,1))
[1] 81.32257

This is not close to zero, so this shape and scale don’t get very close to the right mean and SD (you can calculate that the mean is 3 and the variance is also 3, so the SD is about 1.7). These are not close to 12 and 2.3.

All right, how can we use that to find the shape and scale that will give us the right mean and SD? If you are familiar with Excel or the like, you might know about Optimize and Goal Seek, that let you find the values that minimize the value in some other cell (optimize) or hit target values in other cells (goal seek). R has a function optim that will find the input values that minimize a function. This is why I wrote f as I did: the smallest value it can take is zero, when the input shape and scale produce a mean exactly 12 and an SD exactly 2.3. Otherwise, the function will produce a value that is positive (mean minus 12 squared will be positive and/or sd minus 2.3 squared will be positive, because the square of a non-zero thing will always be positive, whether the thing itself is positive or negative). So if we minimize f, the shape and scale at which it is minimized will be the ones that give us the right mean and SD.

optim has a lot of complication, but we are using only the most basic options here. All we need to supply is an initial guess at what the best shape and scale might be (which can be really bad), and the function. My really bad values of shape and scale are the ones we used just above, which are not at all close to the answer. I save the output and look at it, because it contains a lot of things:

ans <- optim(c(3,1), f)
ans
$par
[1] 27.2309602  0.4406884

$value
[1] 2.528309e-07

$counts
function gradient 
     141       NA 

$convergence
[1] 0

$message
NULL

The values in par are the scale and shape that minimize the function (that is, produce mean and SD closest to 12 and 2.3 respectively). These are the 27.23 and 0.44 that I had you use (those values are rounded off slightly). value is the minimized value of the function, which as you see is very close to zero, as we were expecting.5 counts tells us that optim evaluated the function 141 times before deciding it had found the answer. This may seem like a lot, but bear in mind that there are a lot of combinations of shape and scale that might be worth looking at. gradient is the first derivative, for veterans of calculus. You can also supply optim a function that calculates the first derivative of f for input scale and shape. This will usually help optim find the answer faster, but I was too lazy to work it out here. If convergence is zero, optim is confident that it found the answer (good news); a non-zero value is coupled with a message that tells you what went wrong. Here, though, all is good.

Let’s use our optimal scale and shape to work out how close we were to the target mean and SD:

v <- ans$par
v
[1] 27.2309602  0.4406884
v[1]*v[2]
[1] 12.00037
sqrt(v[1]*v[2]^2)
[1] 2.299657

We’re not going to get much closer than that!6

\(\blacksquare\)

  1. Make a histogram of your random sample of gamma-distributed values, and comment briefly on its shape.

Solution

Start with your dataframe:

ggplot(d, aes(x = g)) + geom_histogram(bins = 10)

This is very slightly skewed to the right, but it is not that far away from being normal.

A normal quantile plot should tell the same story:

ggplot(d, aes(sample = g)) + stat_qq() + stat_qq_line()

This is not far from normal (points all fairly close to the line), but the lowest values don’t go down far enough, and the highest ones are too high: that is, the distribution is somewhat right-skewed. (Your random gamma values will most likely be different from mine, so your histogram, and normal quantile plot if you drew one, will also most likely be slightly different from mine, but they should have similar shapes to mine.)

I am expecting most people to have a slightly right-skewed histogram. Gamma distributions in general are right-skewed because of the lower limit of zero, but you can see that this one doesn’t get very close to zero, so the skewness is rather mild.

\(\blacksquare\)

  1. Suppose now that you want to assume that the data have a gamma distribution with this scale and shape (and thus the same mean and SD that you used previously). Modify your simulation to estimate the power of the \(t\)-test against a null mean of 11 against the alternative that the mean is greater than 11, using a sample size of 20, with the data coming from this gamma distribution.

Solution

power.t.test assumes normally-distributed data, which we no longer want to assume, and so we cannot use that. (We are still doing a \(t\)-test, so that part is not a problem.)

In the first simulation from these data, there is actually only one line that needs to change, the third one, which generates the random samples from the truth. Instead of rnorm, this should have rgamma, using the shape and scale values from above, but with a sample size of 20. The test is the same, with the same hypotheses, so nothing else changes:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rgamma(20, shape = 27.23, scale = 0.44))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

Copy, paste, and edit from what you did before. Here, change the rnorm to rgamma and put the right numbers in.

My estimated power is 0.586.

\(\blacksquare\)

  1. Compare the estimated power from earlier and the previous part. Does the similarity or difference make sense? Explain briefly.

Solution

My simulated power values were 0.606 from the normal distribution (or 0.591 by calculation) and 0.586 from the gamma distribution. These are similar, and indicate that the power of the test does not depend much on the exact distribution that generates the data.

Does this make sense? Well, we saw from the histogram that this gamma distribution is actually rather close to looking normal in shape, and so the \(t\)-test should give about the same results whether we assume a normal or a gamma distribution. Thus the power should be about the same, as it is. A second, important, consideration is that with a sample size of 20, a slightly non-normal distribution, as we have, should have next to no impact on the \(t\)-test, and thus also on the power. The results we saw all make sense.

Extra 1: if you want to argue that the power is clearly smaller using the gamma distribution, you can go ahead. You can argue then that the results should be the same (for the reasons above) and they are not, which is surprising. Or you can try to argue that the power should be smaller when the times spent with children have a gamma distribution. One way to do that is to say that the gamma distribution is right-skewed, and therefore you will sometimes draw a far-out7 value from the right tail, which will make the sample SD bigger and therefore the \(t\)-statistic and its P-value smaller. (Values out in the tail will have more of an effect on the SD than they do on the mean.) This means that the \(t\)-test will reject less often than it should, and therefore the power will be smaller. This rather complicated argument will support your case that the difference in power is not surprising.

Extra 2: I’m happy with you calling this a difference or not, and surprising or not, as long as you can make the case. I was curious about the “right” answer. My suspicion is that with gamma-distributed data that is already fairly close to normal and a sample size of 20, the sampling distribution of the sample mean should be pretty close to normal and therefore the power should be the same. There are a couple of ways to assess this.

One is to rerun the power simulation with 10,000 random samples from the gamma distribution, and get a confidence interval for the true power:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rgamma(20, shape = 27.23, scale = 0.44))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

This is literally a copy of what you just did, but changing 1000 to 10000 on the first line. And then:

prop.test(5760, 10000)

    1-sample proportions test with continuity correction

data:  5760 out of 10000, null probability 0.5
X-squared = 230.74, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5662365 0.5857048
sample estimates:
    p 
0.576 

56.6% to 58.6%. The 59.1% we got from power.t.test before is actually not in this CI, so it does look as if the power is slightly less when the data have a gamma distribution compared to when the data distribution is normal.

A second way is to see what the sampling distribution of the sample mean looks like for samples from this gamma distribution: does it look normal?

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rgamma(20, shape = 27.23, scale = 0.44))) %>% 
  mutate(my_mean = mean(my_sample)) -> means
ggplot(means, aes(sample = my_mean)) + stat_qq() + stat_qq_line()  

It really does. So this says that we should expect the power not to change.

The first three lines are the same as above. What I did after that was to work out the mean of each simulated sample, and make a histogram of those. This is like what we did with the bootstrap, but here we “know” what the true data distribution is (or, we are playing what-if), so we sample from that rather than re-sampling from the data.

This is one of those problems where is there is not a “right” answer, and your job as an applied statistician is to figure out which of these arguments you are most convinced by and make the case for that. Or, if you prefer, present both arguments and say that it could go either way. In this one, though, if the power is different, it is not different by much, and thus if you were to recommend a sample size, the one you got before would not be far wrong. (If you wanted to allow for the power from gamma-distributed data being less, you would make the sample size slightly bigger than before.)

\(\blacksquare\)

Footnotes

  1. I wanted to give you one that would produce a suitable power for a sample size of 100.↩︎

  2. You might suppose that if the mean is bigger, and this is a random variable that must be positive, that the SD might also be bigger than in the previous research, but unless you have data from a “pilot study”, that is, a small version of the study you are going to do, you have no way to supply a suitable bigger value, so the best you can do is to stick with the value you have.↩︎

  3. The one I’m thinking of from lecture didn’t have an alternative because the default is two-sided, which is what it was. The two-sample example from lecture was one-sided, because the new method for teaching reading was supposed to be better, so that is the one to borrow from.↩︎

  4. There is nothing magic about the values 0.2, 0.5, and 0.8. People seem to have latched onto them as indicators of effect size, in the same way that people have latched on to \(\alpha = 0.05\). All you can really say about effect size is that the larger it is, the smaller the sample size you need to prove that it is real: that is, the smaller the sample size you need to obtain the power of your choice.↩︎

  5. There should be some scale and shape that get the mean and SD exactly right.↩︎

  6. There is no randomness here, so that in principle you can get as close as you like, but there is a limit to the accuracy of “floating-point”, that is, decimal, arithmetic on computers.↩︎

  7. Not really far-out on this one, but that is the argument.↩︎