Worksheet 5

Published

October 6, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Child psychology

According to research in child psychology, working mothers spend a mean time of 11 minutes per day talking to their children, with a standard deviation of 2.3 minutes. Your research suggests that the mean should be greater than that, and you are planning a study of working mothers (who work outside the home) to see how many minutes they spend talking to their children. You think that the mean should be 12 minutes per day, and you want to design your study so that a mean of 11 should be rejected with a reasonably high probability.

  1. If you interview 20 working mothers, what is the power of your test if your thought about the mean is correct? Estimate by simulation. Assume that the time that a mother spends talking to her children has a normal distribution.

For me, to kick off:

set.seed(457299)

The advantage to doing this is that every time you re-run code in your notebook (starting from the beginning and going all the way through), the set of random numbers in it will be the same, so that if you have talked about a set of random numbers that came out a certain way, your description won’t need to change if you run things again. The number inside the set.seed tells the random number generator where to start, and can be anything; some people use 1 or 123. Mine is an old phone number.

Now to work. The true mean (as far as you are concerned) is 12, so sample from a normal distribution with that mean (and the same standard deviation as the other research, since you don’t have a better value).1 The null mean is the one from the previous research, namely 11 (the one you want to reject), and because you want to prove that the mean is in fact greater than 11, you need an upper-tailed alternative. That leads to this:

library(tidyverse)

Then2 follow the familiar six lines of code:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(20, 12, 2.3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

The estimated power is \(606/1000 = 0.606\). By interviewing 20 working mothers, you are likely to be able to reject the null hypothesis that the mean time spent talking to their children is 11 minutes per day, in favour of an alternative that the mean is larger than this.

Your answer will likely differ from mine by a bit (see below for how big “a bit” might be), but it should be somewhere near this.

The code:

  • set up a data frame with places for 1000 (or 10,000 or whatever) simulated samples
  • work one row at a time from here forward
  • draw a random sample of size 20 from the truth (as far as you know what it is) for each row
  • run a \(t\)-test testing whether the mean is 11 for each row (which you know, or at least think, it isn’t, but the onus is on your research to prove itself over the current state of affairs)
  • get the P-value from each test
  • count how many of those P-values are less than or equal to 0.05. The number of times this is true, divided by your number of simulations, is your estimate of the power.

Extra: you can get a sense of how far off the simulation might have been by counting rejection as a “success” and non-rejection as a “failure”; my simulation gave 606 successes in 1000 trials, and a 95% confidence interval for the probability of success comes from a formula you might have run into, based on the normal approximation to the binomial, or this:

prop.test(606, 1000)

    1-sample proportions test with continuity correction

data:  606 out of 1000, null probability 0.5
X-squared = 44.521, df = 1, p-value = 2.516e-11
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5748595 0.6363158
sample estimates:
    p 
0.606 

A 95% CI for the true power of the test is from 0.575 to 0.636, and for me, the right answer (from power.t.test) is definitely inside that.

With 1000 simulations, you’ll see that the confidence interval goes about 3 percentage points up and down from the estimate. A rule of thumb is that confidence intervals for proportions based on samples of size 1000 will have a margin of error of about 3 percentage points (as long as the true proportion is somewhere near 0.5). Check this on the next opinion poll result you see: usually it will say something like “accurate to within 3 percentage points 19 times out of 20”, which is their calculated margin of error for a 95% CI. We know that the right answer in this case (from power.t.test) is 0.59, so if your estimated-by-simulation power comes out between about 0.56 and about 0.62, that is as expected. If it doesn’t, either you were unlucky or you have a coding error (such as, forgetting to do the test one-sided).

If you did 10,000 simulated \(t\)-tests instead of my 1000, you will have a shorter CI that should likewise contain 0.591, for example:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rnorm(20, 12, 2.3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

and

prop.test(5858, 10000)

    1-sample proportions test with continuity correction

data:  5858 out of 10000, null probability 0.5
X-squared = 294.12, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5760642 0.5954695
sample estimates:
     p 
0.5858 

This goes to show that the extra time you spend waiting for 10,000 simulations pays off in the accuracy of your estimated power. We can now be pretty sure that the power is not as big as 0.6.

  1. Explain briefly why power.t.test can be used to calculate an answer to this problem, and use it to check your result in the previous part.

We can use power.t.test for two reasons: (i) we actually are doing a \(t\)-test (for a population mean), and (ii) we are assuming that the times spent talking to children have a normal distribution.

For power.t.test we need to specify the sample size n, the difference between the true and null means delta, the population SD sd, the type of test (a one-sample \(t\)), and whether it is one- or two-sided:

power.t.test(n = 20, delta = 12 - 11, sd = 2.3, type = "one.sample", alternative = "one.sided")

     One-sample t test power calculation 

              n = 20
          delta = 1
             sd = 2.3
      sig.level = 0.05
          power = 0.5908187
    alternative = one.sided

If you are like me, you probably typed alternative = "greater", as you would with t.test, but then the error message will tell you what you actually need, either one.sided or two.sided. You need to be aware that information about the one-sidedness of your test has to find its way into power.t.test somehow; it turns out that it doesn’t matter which kind of one-sided test you have (greater or less), just that you have a one-sided test at all.3

The power here is actually 0.590. My simulated value of 0.606 was close. Your simulated value in the previous part will probably be different from mine, but it ought to be somewhere near to 0.590. (See the Extra in the previous part for how close you would expect to be.)

\(\blacksquare\)

  1. A standard level of power in studies in child psychology is 80% (0.80). According to what you have seen so far, is it necessary to interview 20 working mothers, or more, or less? Explain briefly. Use power.t.test to obtain an appropriate sample size, under the assumptions you have made.

Solution

Our power is smaller than the target power of 0.8, so our sample size should be bigger than 20. (A bigger sample size will give a bigger power.)

Use power.t.test, inserting a value for power, and leaving out n (since that is what we want the function to find for us):

power.t.test(power = 0.80, delta = 12 - 11, sd = 2.3, type = "one.sample", alternative = "one.sided")

     One-sample t test power calculation 

              n = 34.10007
          delta = 1
             sd = 2.3
      sig.level = 0.05
          power = 0.8
    alternative = one.sided

Round the sample size up to 35. We need to interview 35 working mothers to get 80% power.

Extra: Psychologists (and other people) like to quantify things in terms of “effect size”, which is a difference in means divided by a standard deviation (like a \(z\)-score but applied to means), which some people know of as “Cohen’s \(d\)”:

d <- (12 - 11)/ 2.3
d
[1] 0.4347826

According to this website, an effect size of 0.2 is considered “small”, 0.5 “moderate” and 0.8 “large”, so what we have here is a “moderate effect”, one that needs a moderate sample size to provide convincing evidence of.4

Two-sample power

Suppose we have two populations, which are supposing to both be normally distributed. The first population has a mean of 20, and the second population has a mean of 25. Both populations have the same SD of 9.

  1. Suppose we take samples of 30 observations from each population. Use power.t.test to find the probability that a two-sample \(t\)-test will (correctly) reject the null hypothesis that the two populations have the same mean, in favour of a one-sided alternative. (Hint: delta should be positive.)

Solution

Note first that power.t.test works here because (i) the populations have normal distributions, (ii) the SDs are the same, (iii) the sample sizes from each population are the same.

Then, subtracting the means so that the bigger one is first (and thus delta is positive):

power.t.test(n = 30, delta = 25-20, sd = 9, type = "two.sample", alternative = "one.sided")

     Two-sample t test power calculation 

              n = 30
          delta = 5
             sd = 9
      sig.level = 0.05
          power = 0.6849538
    alternative = one.sided

NOTE: n is number in *each* group

The power is about 0.68. (Give about one more decimal if you prefer.)

You’ll recall that we don’t say which one side we want to reject for. The way power.t.test does it is to assume the first population has the bigger mean, which is why I said to use a positive delta. (Otherwise you’ll get an extremely small power, which corresponds to concluding that the population with the smaller mean actually has the bigger one, which you could imagine is not going to happen much.)

\(\blacksquare\)

  1. Find the sample size needed to obtain a power of 0.75. Comment briefly on whether your sample size makes sense.

Solution

The coding part is not meant to be hard: take what you just did, and replace the n with the power you are shooting for:

power.t.test(power = 0.75, delta = 25-20, sd = 9, type = "two.sample", alternative = "one.sided")

     Two-sample t test power calculation 

              n = 35.55396
          delta = 5
             sd = 9
      sig.level = 0.05
          power = 0.75
    alternative = one.sided

NOTE: n is number in *each* group

The required sample size is 36 in each group, which, were you doing this on an assignment, you would need to say. (Remember also to round up to the next whole number, since you want at least that much power.)

As to why the answer makes sense, well, before we had a sample size of 30 in each group and a power about 0.68. Our target power here was slightly bigger than that, and to increase the power (all else fixed) we need to increase the sample size as well. The sample size needed is slightly bigger than 30, which matches with the power desired being slightly bigger than 0.68.

\(\blacksquare\)

  1. Reproduce your power result from (a) by simulation. Some things to consider:
  • you will need to generate two columns of random samples, one from each population
  • t.test can also run a two-sample \(t\)-test by giving the two columns separately, rather than as we have done it before by having a column with all the measurements and a separate column saying which group they came from.
  • you will need to get the right alternative. With two columns input like this, the alternative is relative to the column you give first.

Solution

With all that in mind, something like this:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(sample1 = list(rnorm(30, 20, 9)),
         sample2 = list(rnorm(30, 25, 9))) %>% 
  mutate(my_test = list(t.test(sample1, sample2, alternative = "less"))) %>% 
  mutate(p_val = my_test$p.value) %>% 
  count(p_val <= 0.05)

My estimated power is 0.675, entirely consistent with part (a).

Code notes:

  • generate two columns of random samples, one from each population
  • feed the two columns of samples into t.test. The first population actually had the smaller mean, so that is the one side that you would like the test to reject in favour of.

\(\blacksquare\)

  1. Give an example of a situation where the simulation approach could be used and power.t.test not.

Solution

In my solution to the first question I said

“Note first that power.t.test works here because (i) the populations have normal distributions, (ii) the SDs are the same, (iii) the sample sizes from each population are the same.”

These all need to be true in order to use power.t.test, so to answer this one, describe a situation where one of them fails, eg:

  • the population distributions are something other than normal (“child psychology revisited” is an example of this)
  • the two populations have different SDs (spreads). I think power.t.test is using the pooled test behind the scenes.
  • the sample sizes are different (for example, sampling from one population might be more expensive than sampling from the other, and you might want to achieve a certain power with one sample size being twice the other).

It is also worth noting that you could make these changes in your simulation code without great difficulty:

  • change rnorm to something else
  • change the 9’s in rnorm to something else (the one in sample1 different from the one in sample2)
  • change the 30’s in rnorm to something else.

\(\blacksquare\)

Child psychology, revisited

This is a continuation of an earlier question about talking to children.

  1. Another distribution that might be suitable for time spent talking to children is the gamma distribution. Values from a gamma distribution are guaranteed to be greater than zero (which is suitable for times spent talking to children). As far as R is concerned, a random value from a gamma distribution is generated using the function rgamma. This, for us, has three inputs: the number of values to generate, a parameter called shape for which we will use the value 27.23, and a parameter called scale for which we will use the value 0.44. Generate a random sample of 1000 values from a gamma distribution with the given parameter values. Hint: make sure that the inputs that need names actually have names, and organize your results as a column in a dataframe.

Solution

Following the hint, something like this:

d <- tibble(g = rgamma(1000, shape = 27.23, scale = 0.44))
d

Displaying the dataframe as usual shows the first 10 rows, which once again will enable anyone reading your work to see that you have the right kind of thing.

I am apparently using my standard “temporary dataframe name” of d. Give it and the column in it whatever names you think make sense. For example, you might use gamma for the column and gammas (plural) for the dataframe.

\(\blacksquare\)

  1. Find the mean and SD of your random sample of values from the gamma distribution. Are the mean and SD somewhere close to the mean and SD you used in your first power analysis? Explain (very) briefly.

Solution

The first thing is the mean and SD of your random values. Starting from a dataframe, this is summarize:

d %>% summarize(g_mean = mean(g), g_sd = sd(g))

The mean and SD we used in the first simulation of power (in 1(a)) were 12 and 2.3 respectively, and these (for me) are very close to that. (Calculating the mean and SD and saying just “these are close” will confuse your reader, so make sure you say what they are supposed to be close to.)

Extra: I actually defined the shape and scale for your gamma distribution to be the ones that would give you the right mean and SD on average, and so the only reason they didn’t come out exactly the same is randomness, in your drawing of 1000 random values. Asking you to obtain 1000 random values (rather than fewer) should mean that your mean and SD come out close to the right thing with high probability. More values will also give you a better picture (next part).

If you look in the help for rgamma, by typing ?rgamma in the console (the help comes out bottom right), you’ll see (hiding in the Details) that the mean of a gamma distribution is shape times scale, and the variance (SD squared) is shape times scale-squared. This means that if you know shape and scale, you also know what the mean and SD of the distribution are. But our problem is the opposite way around: we know what mean and SD we want, and we want to find the shape and scale that produce them. This actually looks as if you can solve it using algebra, but I am too lazy to do algebra tonight, so I want to show you how I actually did it, which is better than the trial and error you would otherwise guess.

The starting point is a function, rather unimaginatively called f, that takes as input a scale and shape, gets hold of the mean and SD that go with that scale and shape, and then sees how close we are to the 12 and 2.3 we were aiming for. It looks like this:

f <- function(x) {
  shape <- x[1]
  scale <- x[2]
  mean <- shape * scale
  variance <- shape * scale^2
  sd <- sqrt(variance)
  (mean - 12)^2 + (sd - 2.3)^2
}

The input is actually a vector called x which contains the shape as its first thing and the scale as the second one. If you have not seen an R function up close before, this is what one looks like. On the top line is the name of the function (f), the word function, and its input (here just x which actually contains two values, the first one being the shape and the second one the scale.) Then, inside curly brackets, the actual business of the function. If you have written a function in Python before, you might remember that they start with def and use indentation rather than curly brackets. They also use return to mark the end result of the calculation that is sent back to the outside world. R is a bit different; the last line of a function is typically a calculation, and the result of that calculation is what is sent back to the outside world.

To take you through my function:

  • first pull the scale and shape out of the input x (so that I remember which is which)
  • work out the mean corresponding to the input shape and scale
  • work out the variance ditto
  • work out the SD from the variance (by taking its square root)
  • calculate something that will be zero if the mean is exactly 12 and the SD is exactly 2.3, and something bigger than zero otherwise. The idea is that the function tells me how close I am to getting the mean and SD right from my input shape and scale. For example:
f(c(3,1))
[1] 81.32257

This is not close to zero, so this shape and scale don’t get very close to the right mean and SD (you can calculate that the mean is 3 and the variance is also 3, so the SD is about 1.7). These are not close to 12 and 2.3.

All right, how can we use that to find the shape and scale that will give us the right mean and SD? If you are familiar with Excel or the like, you might know about Optimize and Goal Seek, that let you find the values that minimize the value in some other cell (optimize) or hit target values in other cells (goal seek). R has a function optim that will find the input values that minimize a function. This is why I wrote f as I did: the smallest value it can take is zero, when the input shape and scale produce a mean exactly 12 and an SD exactly 2.3. Otherwise, the function will produce a value that is positive (mean minus 12 squared will be positive and/or sd minus 2.3 squared will be positive, because the square of a non-zero thing will always be positive, whether the thing itself is positive or negative). So if we minimize f, the shape and scale at which it is minimized will be the ones that give us the right mean and SD.

optim has a lot of complication, but we are using only the most basic options here. All we need to supply is an initial guess at what the best shape and scale might be (which can be really bad), and the function. My really bad values of shape and scale are the ones we used just above, which are not at all close to the answer. I save the output and look at it, because it contains a lot of things:

ans <- optim(c(3,1), f)
ans
$par
[1] 27.2309602  0.4406884

$value
[1] 2.528309e-07

$counts
function gradient 
     141       NA 

$convergence
[1] 0

$message
NULL

The values in par are the scale and shape that minimize the function (that is, produce mean and SD closest to 12 and 2.3 respectively). These are the 27.23 and 0.44 that I had you use (those values are rounded off slightly). value is the minimized value of the function, which as you see is very close to zero, as we were expecting.5 counts tells us that optim evaluated the function 141 times before deciding it had found the answer. This may seem like a lot, but bear in mind that there are a lot of combinations of shape and scale that might be worth looking at. gradient is the first derivative, for veterans of calculus. You can also supply optim a function that calculates the first derivative of f for input scale and shape. This will usually help optim find the answer faster, but I was too lazy to work it out here. If convergence is zero, optim is confident that it found the answer (good news); a non-zero value is coupled with a message that tells you what went wrong. Here, though, all is good.

Let’s use our optimal scale and shape to work out how close we were to the target mean and SD:

v <- ans$par
v
[1] 27.2309602  0.4406884
v[1]*v[2]
[1] 12.00037
sqrt(v[1]*v[2]^2)
[1] 2.299657

We’re not going to get much closer than that!6

\(\blacksquare\)

  1. Make a histogram of your random sample of gamma-distributed values, and comment briefly on its shape.

Solution

Start with your dataframe:

ggplot(d, aes(x = g)) + geom_histogram(bins = 10)

This is very slightly skewed to the right, but it is not that far away from being normal.

Extra: if you have learned about the normal quantile plot by the time you read this:

ggplot(d, aes(sample = g)) + stat_qq() + stat_qq_line()

This is not far from normal (points all fairly close to the line), but the lowest values don’t go down far enough, and the highest ones are too high: that is, the distribution is somewhat right-skewed. (Your random gamma values will most likely be different from mine, so your histogram, and normal quantile plot if you drew one, will also most likely be slightly different from mine, but they should have similar shapes to mine.)

I am expecting most people to have a slightly right-skewed histogram. Gamma distributions in general are right-skewed because of the lower limit of zero, but you can see that this one doesn’t get very close to zero, so the skewness is rather mild.

\(\blacksquare\)

  1. Suppose now that you want to assume that the data have a gamma distribution with this scale and shape (and thus the same mean and SD that you used previously). Modify your simulation to estimate the power of the \(t\)-test against a null mean of 11 against the alternative that the mean is greater than 11, using a sample size of 20, with the data coming from this gamma distribution.

Solution

power.t.test assumes normally-distributed data, which we no longer want to assume, and so we cannot use that. (We are still doing a \(t\)-test, so that part is not a problem.)

In the first simulation from these data, there is actually only one line that needs to change, the third one, which generates the random samples from the truth. Instead of rnorm, this should have rgamma, using the shape and scale values from above, but with a sample size of 20. The test is the same, with the same hypotheses, so nothing else changes:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rgamma(20, shape = 27.23, scale = 0.44))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

Copy, paste, and edit from what you did before. Here, change the rnorm to rgamma and put the right numbers in.

My estimated power is 0.586.

\(\blacksquare\)

  1. Compare the estimated power from earlier and the previous part. Does the similarity or difference make sense? Explain briefly.

Solution

My simulated power values were 0.606 from the normal distribution (or 0.591 by calculation) and 0.586 from the gamma distribution. These are similar, and indicate that the power of the test does not depend much on the exact distribution that generates the data.

Does this make sense? Well, we saw from the histogram that this gamma distribution is actually rather close to looking normal in shape, and so the \(t\)-test should give about the same results whether we assume a normal or a gamma distribution. Thus the power should be about the same, as it is. A second, important, consideration is that with a sample size of 20, a slightly non-normal distribution, as we have, should have next to no impact on the \(t\)-test, and thus also on the power. The results we saw all make sense.

Extra 1: if you want to argue that the power is clearly smaller using the gamma distribution, you can go ahead. You can argue then that the results should be the same (for the reasons above) and they are not, which is surprising. Or you can try to argue that the power should be smaller when the times spent with children have a gamma distribution. One way to do that is to say that the gamma distribution is right-skewed, and therefore you will sometimes draw a far-out7 value from the right tail, which will make the sample SD bigger and therefore the \(t\)-statistic and its P-value smaller. (Values out in the tail will have more of an effect on the SD than they do on the mean.) This means that the \(t\)-test will reject less often than it should, and therefore the power will be smaller. This rather complicated argument will support your case that the difference in power is not surprising.

Extra 2: I’m happy with you calling this a difference or not, and surprising or not, as long as you can make the case. I was curious about the “right” answer. My suspicion is that with gamma-distributed data that is already fairly close to normal and a sample size of 20, the sampling distribution of the sample mean should be pretty close to normal and therefore the power should be the same. There are a couple of ways to assess this.

One is to rerun the power simulation with 10,000 random samples from the gamma distribution, and get a confidence interval for the true power:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rgamma(20, shape = 27.23, scale = 0.44))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 11, alternative = "greater"))) %>% 
  mutate(p_val = t_test$p.value) %>% 
  count(p_val <= 0.05)

This is literally a copy of what you just did, but changing 1000 to 10000 on the first line. And then:

prop.test(5760, 10000)

    1-sample proportions test with continuity correction

data:  5760 out of 10000, null probability 0.5
X-squared = 230.74, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5662365 0.5857048
sample estimates:
    p 
0.576 

56.6% to 58.6%. The 59.1% we got from power.t.test before is actually not in this CI, so it does look as if the power is slightly less when the data have a gamma distribution compared to when the data distribution is normal.

A second way is to see what the sampling distribution of the sample mean looks like for samples from this gamma distribution: does it look normal?8

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rgamma(20, shape = 27.23, scale = 0.44))) %>% 
  mutate(my_mean = mean(my_sample)) -> means
ggplot(means, aes(x = my_mean)) + geom_histogram(bins = 10)

ggplot(means, aes(sample = my_mean)) + stat_qq() + stat_qq_line()  

It really does. So this says that we should expect the power not to change.

The first three lines are the same as above. What I did after that was to work out the mean of each simulated sample, and make a histogram of those. This is like what we did with the bootstrap, but here we “know” what the true data distribution is (or, we are playing what-if), so we sample from that rather than re-sampling from the data.

This is one of those problems where is there is not a “right” answer, and your job as an applied statistician is to figure out which of these arguments you are most convinced by and make the case for that. Or, if you prefer, present both arguments and say that it could go either way. In this one, though, if the power is different, it is not different by much, and thus if you were to recommend a sample size, the one you got before would not be far wrong. (If you wanted to allow for the power from gamma-distributed data being less, you would make the sample size slightly bigger than before.)

\(\blacksquare\)

Footnotes

  1. You might suppose that if the mean is bigger, and this is a random variable that must be positive, that the SD might also be bigger than in the previous research, but unless you have data from a “pilot study”, that is, a small version of the study you are going to do, you have no way to supply a suitable bigger value, so the best you can do is to stick with the value you have.↩︎

  2. The set.seed is actually part of “base R”, so doesn’t need the tidyverse to run. If you are in the probably good habit of running library(tidyverse) before you do anything else, that way also works here.↩︎

  3. The one I’m thinking of from lecture didn’t have an alternative because the default is two-sided, which is what it was. The two-sample example from lecture was one-sided, because the new method for teaching reading was supposed to be better, so that is the one to borrow from.↩︎

  4. There is nothing magic about the values 0.2, 0.5, and 0.8. People seem to have latched onto them as indicators of effect size, in the same way that people have latched on to \(\alpha = 0.05\). All you can really say about effect size is that the larger it is, the smaller the sample size you need to prove that it is real: that is, the smaller the sample size you need to obtain the power of your choice.↩︎

  5. There should be some scale and shape that get the mean and SD exactly right.↩︎

  6. There is no randomness here, so that in principle you can get as close as you like, but there is a limit to the accuracy of “floating-point”, that is, decimal, arithmetic on computers.↩︎

  7. Not really far-out on this one, but that is the argument.↩︎

  8. This idea comes from the “bootstrap” lecture.↩︎