Worksheet 6

Published

October 8, 2025

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(smmr)

Two-sample power

Suppose we have two populations, which we are supposing to both be normally distributed. The first population has a mean of 20, and the second population has a mean of 25. Both populations have the same SD of 9.

  1. Suppose we take samples of 30 observations from each population. Use power.t.test to find the probability that a two-sample \(t\)-test will (correctly) reject the null hypothesis that the two populations have the same mean, in favour of a one-sided alternative. (Hint: delta should be positive.)

Solution

Note first that power.t.test works here because (i) the populations have normal distributions, (ii) the SDs are the same, (iii) the sample sizes from each population are the same.

Then, subtracting the means so that the bigger one is first (and thus delta is positive):

power.t.test(n = 30, delta = 25-20, sd = 9, type = "two.sample", alternative = "one.sided")

     Two-sample t test power calculation 

              n = 30
          delta = 5
             sd = 9
      sig.level = 0.05
          power = 0.6849538
    alternative = one.sided

NOTE: n is number in *each* group

The power is about 0.68. (Give about one more decimal if you prefer.)

You’ll recall that we don’t say which one side we want to reject for. The way power.t.test does it is to assume the first population has the bigger mean, which is why I said to use a positive delta. (Otherwise you’ll get an extremely small power, which corresponds to concluding that the population with the smaller mean actually has the bigger one, which you could imagine is not going to happen much.)

\(\blacksquare\)

  1. Find the sample size needed to obtain a power of 0.75. Comment briefly on whether your sample size makes sense.

Solution

The coding part is not meant to be hard: take what you just did, and replace the n with the power you are shooting for:

power.t.test(power = 0.75, delta = 25-20, sd = 9, type = "two.sample", alternative = "one.sided")

     Two-sample t test power calculation 

              n = 35.55396
          delta = 5
             sd = 9
      sig.level = 0.05
          power = 0.75
    alternative = one.sided

NOTE: n is number in *each* group

The required sample size is 36 in each group, which, were you doing this on an assignment, you would need to say. (Remember also to round up to the next whole number, since you want at least that much power.)

As to why the answer makes sense, well, before we had a sample size of 30 in each group and a power about 0.68. Our target power here was slightly bigger than that, and to increase the power (all else fixed) we need to increase the sample size as well. The sample size needed is slightly bigger than 30, which matches with the power desired being slightly bigger than 0.68.

\(\blacksquare\)

  1. Reproduce your power result from Question 1 by simulation. Some things to consider:
  • you will need to generate two columns of random samples, one from each population
  • t.test can also run a two-sample \(t\)-test by giving the two columns separately, rather than as we have done it before by having a column with all the measurements and a separate column saying which group they came from.
  • you will need to get the right alternative. With two columns input like this, the alternative is relative to the column you give first.

Solution

With all that in mind, something like this:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(sample1 = list(rnorm(30, 20, 9)),
         sample2 = list(rnorm(30, 25, 9))) %>% 
  mutate(my_test = list(t.test(sample1, sample2, alternative = "less"))) %>% 
  mutate(p_val = my_test$p.value) %>% 
  count(p_val <= 0.05)

My estimated power is 0.675, entirely consistent with Question 1. The rule of thumb says that with 1000 simulations, our estimated power here should be within 0.03 of the correct answer, 95% of the time. If your answer here is not between about 0.650 and 0.720, that would be a good reason to look at your code carefully. Did you miss something, like the alternative?

Code notes:

  • generate two columns of random samples, one from each population
  • feed the two columns of samples into t.test. The first population actually had the smaller mean, so that is the one side that you would like the test to reject in favour of.

\(\blacksquare\)

  1. Give an example of a situation where the simulation approach could be used and power.t.test not.

Solution

In my solution to the first question I said

“Note first that power.t.test works here because (i) the populations have normal distributions, (ii) the SDs are the same, (iii) the sample sizes from each population are the same.”

These all need to be true in order to use power.t.test, so to answer this one, describe a situation where one of them fails, eg:

  • the population distributions are something other than normal (“child psychology revisited” is an example of this)
  • the two populations have different SDs (spreads). I think power.t.test is using the pooled test behind the scenes.
  • the sample sizes are different (for example, sampling from one population might be more expensive than sampling from the other, and you might want to achieve a certain power with one sample size being twice the other).

It is also worth noting that you could make these changes in your simulation code without great difficulty:

  • change rnorm to something else
  • change the 9’s in rnorm to something else (the one in sample1 different from the one in sample2)
  • change the 30’s in rnorm to something else.

\(\blacksquare\)

The thickness of stamps

Collectors of postage stamps know that the same stamp may be made from several different batches of paper of different thicknesses. Our data set, in http://ritsokiguess.site/datafiles/stamp.csv, contains the thickness, in millimetres, of each of 485 stamps that were printed in 1872. It is suspected that the paper used in that year was thinner than in previous years.

  1. Read in and display (some of) the data.

Solution

Very much the usual:

my_url <- "http://ritsokiguess.site/datafiles/stamp.csv"
stamp <- read_csv(my_url)
Rows: 485 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): Thickness

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stamp

Note for yourself that there is one column called Thickness with an uppercase T, and that the thicknesses appear to be arranged in ascending order.

\(\blacksquare\)

  1. Make a suitable graph of these data. Justify your choice briefly.

Solution

The obvious justification and graph is that you have one quantitative variable, so that what you need is a histogram. This is absolutely fine:

ggplot(stamp, aes(x = Thickness)) + geom_histogram(bins = 10)

There are almost 500 observations, so you can certainly justify something like 8–10 bins. There is actually a bit of extra shape here that 10 bins helps you see (that maybe-extra-peak around 0.10mm); the guiding principle is that if there are more details of shape that you want to convey, you’ll need more bins that you would if everything were nicely bell-shaped. I discuss that funky shape in an Extra.

Your other option here is a normal quantile plot. For this, you need to say why specifically normality is of interest to you here. You can do that by looking ahead and seeing that you will be doing a sign test, and at that point you realize the data are probably going to be non-normal, so that normality is a concern here. Or, more simply, you can say that the distribution of thicknesses being normal (or not) will tell you which test to do (a one-sample \(t\) or a sign test). This is why I asked you for a brief justification: you should be choosing a normal quantile plot because what matters to you is whether or not the data values have a normal distribution:

ggplot(stamp, aes(sample = Thickness)) + stat_qq() + stat_qq_line()

This, plus something that justifies a normal quantile plot over a histogram, would also be full marks (if there were marks attached to this).

Extra 1: you can sort of see the bimodality in the normal quantile plot as well, if you know where to look. The idea is that when the pattern of points is steeper than the line, there are fewer observations there than you would expect with a normal distribution; when the pattern is less steep than the line, there are more observations than expected with a normal. Thus, places where the points are suddenly less steep are places to look at an extra peak.

I’m going to come back to our data in a moment, but an easier-to-see one is here (scroll down to near the bottom to the three-humped multimodal example there in Figure 5.12). On their histogram, there is a “hole” just under 0.75 (no data between the second and third peaks), and this shows up on the normal quantile plot also as a gap below 0.75 on the \(y\)-axis,1 or, said differently, the points jump upwards very fast between 0.625 and 0.75 on the \(y\)-axis. With that in mind, you can see another quick jump between about 0.26 and 0.35 (on the \(y\)-axis), which corresponds to the other “hole” between the first and second peaks.

To find where the peaks might be, you look for places where the points on the normal quantile plot are close to flat. I see one place around 0.125 (on the \(y\)-axis again), which is the first peak, around 0.5 (in between the two “gaps”; the second peak), and another approaching 1 (the third peak).

As an aside, in that example they also show a boxplot of the same data. You can see that the boxplot for a three-humped distribution and a uniform (flat-topped) distribution are almost identical. Hence, boxplots are no help in searching for multiple modes.

That example was (presumably) made-up data to illustrate what happens; ours are real data, where things are going to be more messy. In ours, I think it is easier to search for places where the points are steeper than the line (the gaps between peaks). There is a steep bit that extends from about 0.08 to about 0.10, with flatter sections at each end, which suggests peaks around 0.08 and 0.10 and a “hole” in between, relative to the normal at least. The points on the normal quantile plot have an odd shape, odder than the simple curve you would expect for right-skewness, so there might be more peaks as well, for example around 0.12 or around 0.11.

Extra 2: in my original source for these data, the interest was actually not in the mean or median thickness, but in something else: the stamps issued in that year were thought to be a “philatelic mixture”, that is, printed on more than one type of paper. You would expect different types of paper to have different thicknesses (especially in 1872, when manufacturing processes were a good deal less uniform than they are now).

If you go back to my histogram above, you see that the histogram is skewed to the right (which we take up shortly), but also there looks if there is a second mode (peak) just above 0.10 mm. This would support the idea that these stamps were made from two different types of paper: one with average thickness around 0.08 mm, and one with average thickness just above 0.10 mm. (The relatively small number of stamps of thickness around 0.09 mm supports this: stamps of thickness 0.09 could be unusually thick ones from the 0.08 mm paper, or unusually thin ones from the 0.10 mm paper). (This is why a one-sample boxplot is not on the table here; that plot effectively assumes unimodality (one peak). If you have a unimodal distribution, a boxplot tells you about the centre, spread, and shape, but if you have a multimodal distribution, those peaks just disappear on a boxplot and you never see them.)

Going back to the histogram, you might be wondering whether there are more modes that this histogram hides. This is one of the rare cases where looking at a lot more bins is helpful, in this case 30:

ggplot(stamp, aes(x = Thickness)) + geom_histogram(bins = 30)

Looking at that, you could make the case for as many as 7 modes, though how many of them are real and how many are just chance is another question. In Efron and Tibshirani’s book on the bootstrap (reference below), they use a bootstrap-based method to make the case for 7 modes (and thus 7 different types of paper). Our library has this book, or you may be able to find it in your favourite quasi-legal source. The example starts on page 227.

Reference:

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman and Hall, New York, London

\(\blacksquare\)

  1. From your graph, why do think it might be a good idea to do a sign test rather than a one-sample \(t\)-test on these data? Explain briefly.

Solution

Looking at the graph you drew, the distribution of thicknesses is skewed to the right (or bimodal or multimodal or whatever you saw), and definitely not normal. The sample size is large, but it is not clear that it is large enough to overcome the severe skewness and/or bimodality and/or multimodality. (You can use some words like these to make it clear that you have considered both the distribution shape and the sample size. I am giving you the conclusion to come to; it’s up to you to come up with a reason why I could have done so, which is why I used the word “might”.)

I am not asking for a bootstrap sampling distribution of the sample mean, just for a consideration of shape and sample size. I do this just below if you want to see what it looks like.

Extra: if you are interested in that bootstrap sampling distribution:

set.seed(457299)
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(stamp$Thickness, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

The normality here is in fact absolutely fine, so there would be no problem with a \(t\)-test for the mean. This is because the sample size is so large: the sampling distribution of the sample mean, with the large sample, “smooths out” the multiple modes as well as overcoming the skewness. But I wanted to give you a sign test to do. The mean, though, would be pulled to the right by the skewness, so you could still argue that the median is the right summary statistic and the sign test is the right test, as in the IRS example in lecture.

\(\blacksquare\)

  1. The median thickness in years prior to 1872 was 0.081 mm. Is there evidence that the paper on which stamps were printed in 1872 is thinner than in previous years? Explain briefly.

Solution

A sign test for the median. Input is dataframe, column, and null median:

sign_test(stamp, Thickness, 0.081)
$above_below
below above 
  263   207 

$p_values
  alternative    p_value
1       lower 0.00555186
2       upper 0.99575561
3   two-sided 0.01110372

We are testing the null that the (population) median thickness is 0.081 mm, against the alternative that the median thickness is less than 0.081 mm, so the P-value we need is 0.0056 from the lower line. This is smaller than 0.05, so we reject the null hypothesis in favour of the alternative. That is, we do have evidence that the paper was thinner in 1872.

Say what your P-value is, to make it clear that you are reading the right one off the output. (I might have put an alternative option in sign_test, but I didn’t, opting instead to give you all three P-values.)

Extra 1: having rejected the null, you might be interested in what the median actually is in 1872. This calls for ci_median, but if you run it the obvious way you get something that seems to make no sense:

ci_median(stamp, Thickness)
[1] 0.08 0.08

Well, the lower and upper limits of the confidence interval can’t be the same, can they?

The resolution comes from realizing about three things:

  • the ends of a CI for the median are data values, or, occasionally, halfway between data values.
  • the data values are given to three decimal places (go back and look)
  • to make ci_median give more decimals, set tol to something smaller. To find out that this is what to do, look at the help for ci_median by typing ?ci_median in the console. This reveals that the accuracy to which the ends of the CI are determined is controlled by tol and the default value of tol is 0.01, which will only give us two decimals.

Hence, what we need is this:

ci_median(stamp, Thickness, tol = 1e-3)
[1] 0.07900391 0.08051318

and therefore the interval is from 0.079 to (apparently) 0.0805. (A data value, or halfway between two data values.) This is a very short interval, because we have such a large sample size. In other words, we are confident that the median is less than 0.081, but we are also very confident that the median is only a little less than 0.081.

Extra 2: if you go back and look at the test results, you see that 207 of the observed thicknesses are above 0.081 and 263 are below (of the 470 that are not exactly equal to 0.081; the others are thrown away). This might seem to you close to a 50–50 split, but in fact the P-value is telling you that if you tossed a fair coin 470 times, you would actually be rather unlikely to observe 207 heads or fewer (“more extreme”). To get a bit of intuition about whether this is reasonable, recall that if the number of trials is large (470; it is), the number of values above the hypothesized median has a binomial distribution that is well approximated by a normal distribution with the same mean and SD as the binomial:

bin_mean <- 470 * 0.5 # np
bin_mean
[1] 235
bin_sd <- sqrt(470 * 0.5 * (1 - 0.5)) # variance is np(1-p)
bin_sd
[1] 10.83974
z <- (207 + 0.5 - bin_mean) / bin_sd
z
[1] -2.536961
pnorm(z)
[1] 0.005590973

and so you see that 207 or fewer observations out of 470 above the null median is rather unlikely if the null median is correct, as you might already suspect from the z that is large in absolute value.2 The probability from this normal approximation is very close to the one-sided P-value that was calculated using the (exact) binomial distribution, indicating that the normal approximation is extremely good here.

\(\blacksquare\)

Footnotes

  1. Because the \(y\)-axis is the data axis on this kind of plot.↩︎

  2. The extra 0.5 is a continuity correction. The binomial is discrete (the number of “successes” can only be an integer), while the normal is continuous. To approximate a discrete distribution by a continuous one, you have to say that “less than or equal to 207” means, in a continuous distribution, “less than anything that rounds to 207”, ie. \(< 207.5\). In a continuous distribution, you don’t have to worry about \(<\) vs \(\le\).↩︎