Worksheet 7

Published

October 12, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 The thickness of stamps

Collectors of postage stamps know that the same stamp may be made from several different batches of paper of different thicknesses. Our data set, in http://ritsokiguess.site/datafiles/stamp.csv, contains the thickness, in millimetres, of each of 485 stamps that were printed in 1872. It is suspected that the paper used in that year was thinner than in previous years.

  1. (1 point) Read in and display (some of) the data.

  2. (2 points) Make a suitable graph of these data. Justify your choice briefly.

  3. (2 points) From your graph, why do think it might be a good idea to do a sign test rather than a one-sample \(t\)-test on these data? Explain briefly.

  4. (3 points) The median thickness in years prior to 1872 was 0.081 mm. Is there evidence that the paper on which stamps were printed in 1872 is thinner than in previous years? Explain briefly.

My solutions:

Collectors of postage stamps know that the same stamp may be made from several different batches of paper of different thicknesses. Our data set, in http://ritsokiguess.site/datafiles/stamp.csv, contains the thickness, in millimetres, of each of 485 stamps that were printed in 1872. It is suspected that the paper used in that year was thinner than in previous years.

  1. Read in and display (some of) the data.

Solution

Very much the usual:

my_url <- "http://ritsokiguess.site/datafiles/stamp.csv"
stamp <- read_csv(my_url)
Rows: 485 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): Thickness

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stamp

Note for yourself that there is one column called Thickness with an uppercase T, and that the thicknesses appear to be arranged in ascending order.

\(\blacksquare\)

  1. Make a suitable graph of these data. Justify your choice briefly.

Solution

The obvious justification and graph is that you have one quantitative variable, so that what you need is a histogram. This is absolutely fine:

ggplot(stamp, aes(x = Thickness)) + geom_histogram(bins = 10)

There are almost 500 observations, so you can certainly justify around 10 bins, or a bit less. There is actually a bit of extra shape here that 10 bins helps you see (that maybe-extra-peak around 0.10mm); the guiding principle is that if there are more details of shape that you want to convey, you’ll need more bins that you would if everything were nicely bell-shaped. I discuss that funky shape in an Extra.

Your other option here, now that you’ve seen it in class, is a normal quantile plot. For this, you need to say why specifically normality is of interest to you here. You can do that by looking ahead and seeing that you will be doing a sign test, and at that point you realize the data are probably going to be non-normal, so that normality is a concern here. Or, more simply, you can say that the distribution of thicknesses being normal (or not) will tell you which test to do (a one-sample \(t\) or a sign test). This is why I asked you for a brief justification: you should be choosing a normal quantile plot because what matters to you is whether or not the data values have a normal distribution:

ggplot(stamp, aes(sample = Thickness)) + stat_qq() + stat_qq_line()

This, plus something that justifies a normal quantile plot over a histogram, would also be full marks.

\(\blacksquare\)

  1. From your graph, why do think it might be a good idea to do a sign test rather than a one-sample \(t\)-test on these data? Explain briefly.

Solution

Looking at the graph you drew, the distribution of thicknesses is skewed to the right (or bimodal or multimodal or whatever you saw), and definitely not normal. The sample size is large, but it is not clear that it is large enough to overcome the severe skewness. (You can use some words like these to make it clear that you have considered both the distribution shape and the sample size. I am giving you the conclusion to come to; it’s up to you to come up with a reason why I could have done so, which is why I used the word “might”.)

I am not asking for a bootstrap sampling distribution of the sample mean, just for a consideration of shape and sample size. I do this in an Extra if you want to see what it looks like.

\(\blacksquare\)

  1. The median thickness in years prior to 1872 was 0.081 mm. Is there evidence that the paper on which stamps were printed in 1872 is thinner than in previous years? Explain briefly.

Solution

A sign test for the median. Input is dataframe, column, and null median:

sign_test(stamp, Thickness, 0.081)
$above_below
below above 
  263   207 

$p_values
  alternative    p_value
1       lower 0.00555186
2       upper 0.99575561
3   two-sided 0.01110372

We are testing the null that the (population) median thickness is 0.081 mm, against the alternative that the median thickness is less than 0.081 mm, so the P-value we need is 0.0056 from the lower line. This is smaller than 0.05, so we reject the null hypothesis in favour of the alternative. That is, we do have evidence that the paper was thinner in 1872.

Say what your P-value is, to make it clear that you are reading the right one off the output. (I might have put an alternative option in sign_test, but I didn’t, opting instead to give you all three P-values.)

\(\blacksquare\)

2 Canned tuna

Light tuna is sold in 170-gram cans. The tuna can be canned in either water or oil. Is there a difference in the typical selling price of tuna canned in water or in oil? To find out, 25 supermarkets were sampled. In 14 of them (randomly chosen), the price of a (randomly chosen) brand of tuna in water was recorded, and in the other 11 supermarkets, the price of a (randomly chosen) brand of tuna in oil was recorded. The data are in http://ritsokiguess.site/datafiles/tuna.csv, with the column canned_in saying whether the tuna was canned in oil or water, and the price being in cents.

  1. Read in and display (some of) the data.

  2. Make a graph of just the prices (that is to say, not including what each can of tuna was canned in).

  3. Make a normal quantile plot of the prices (again, ignoring what the tuna was canned in).

  4. What does your normal quantile plot tell you about the distribution of prices? Is that consistent with your first plot? (I would add “explain briefly” if this were for marks, but you should probably think about how your explanation would look in any case.)

  5. Some of the tuna was canned in oil and some in water, which may make a difference to the distribution of price. Make normal quantile plots of price for the tuna packed in oil and packed in water separately, with one ggplot command.

  6. Why might it be a bad idea to run a two-sample \(t\)-test on these data? Explain briefly.

My solutions

Light tuna is sold in 170-gram cans. The tuna can be canned in either water or oil. Is there a difference in the typical selling price of tuna canned in water or in oil? To find out, 25 supermarkets were sampled. In 14 of them (randomly chosen), the price of a (randomly chosen) brand of tuna in water was recorded, and in the other 11 supermarkets, the price of a (randomly chosen) brand of tuna in oil was recorded. The data are in http://ritsokiguess.site/datafiles/tuna.csv, with the column canned_in saying whether the tuna was canned in oil or water, and the price being in cents.

  1. Read in and display (some of) the data.

Solution

Making it easy for you:

my_url <- "http://ritsokiguess.site/datafiles/tuna.csv"
tuna <- read_csv(my_url)
Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): canned_in
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tuna

Scrolling down reveals that the cans of tuna canned in oil are at the bottom.

\(\blacksquare\)

  1. Make a graph of just the prices (that is to say, not including what each can of tuna was canned in).

Solution

A histogram, with one quantitative variable:

ggplot(tuna, aes(x = price)) + geom_histogram(bins = 5)

Not too many bins, with 25 observations altogether. My guess is that small extra peak around 200 is just chance (possibly unlike the question about the stamps).

An option here is a one-sample boxplot (which doesn’t require any thought about bins):

ggplot(tuna, aes(y = price)) + geom_boxplot()

These are both very much skewed to the right, as you would do well to observe (because of the longer upper whisker on the boxplot). The boxplot suggests an outlier, which didn’t show up on the histogram, but which might have done if we had used more bins. Also note that there is no indication on the boxplot that there might be an extra peak, because an assumption hiding behind the boxplot is that there is “a” centre (the median) and some deviation about that, possibly including outliers (but not a second mode).

\(\blacksquare\)

  1. Make a normal quantile plot of the prices (again, ignoring what the tuna was canned in).

Solution

This, remembering that in a normal quantile plot, the thing that you are assessing for normality is called sample:

ggplot(tuna, aes(sample = price)) + stat_qq() + stat_qq_line()

\(\blacksquare\)

  1. What does your normal quantile plot tell you about the distribution of prices? Is that consistent with your first plot? (I would add “explain briefly” if this were for marks, but you should probably think about how your explanation would look in any case.)

Solution

Your histogram (or boxplot) indicated a right-skewed distribution, so you would expect your normal quantile plot to show the same thing. This one certainly does show an upward-opening curve,1 indicating skewness in one direction or the other. To work out which way, look at the lower and upper tails: the lowest values are all very close together, being just over 50 cents, but the highest values are spread out, with the highest one being over $2. This combination of a short left tail and a long right tail means skewed to the right. (The direction of skewness is where the long tail is.)

So the histogram and the normal quantile plot are showing the same thing.

\(\blacksquare\)

  1. Some of the tuna was canned in oil and some in water, which may make a difference to the distribution of price. Make normal quantile plots of price for the tuna packed in oil and packed in water separately, with one ggplot command.

Solution

The key to this is to facet by the grouping variable:

ggplot(tuna, aes(sample = price)) + stat_qq() + stat_qq_line() +
  facet_wrap(~canned_in)

The story is that both groups have a right-skewed distribution of prices. For the tuna packed in water (on the right), this is almost a classic right skew (with a short tail on the left and a long tail on the right). For the tuna packed in oil, things are less clear: it is not obvious that the right tail is really too long, but the left tail is definitely too short. In fact, it looks as if the lowest six prices are from a different distribution entirely than the other five.

\(\blacksquare\)

  1. Why might it be a bad idea to run a two-sample \(t\)-test on these data? Explain briefly.

Solution

For a two-sample \(t\)-test, we assume that both sets of price values are drawn from a normal distribution. These ones are very definitely not, since both distributions are skewed to the right.

The second thing to consider is the sample sizes. These are 14 and 11, not large, and with this amount of skewness the Central Limit Theorem will not help enough (at least, that’s my guess). You could get hold of the bootstrap sampling distributions of the sample means for each group if you wanted to investigate this further.

So, considering these two things, a two-sample \(t\)-test is unwise.

\(\blacksquare\)

3 The thickness of stamps: extras

(b)

Extra: in my original source for these data, the interest was actually not in the mean or median thickness, but in something else: the stamps issued in that year were thought to be a “philatelic mixture”, that is, printed on more than one type of paper. You would expect different types of paper to have different thicknesses (especially in 1872, when manufacturing processes were a good deal less uniform than they are now).

If you go back to my histogram above, you see that the histogram is skewed to the right (which we take up shortly), but also there looks if there is a second mode (peak) just above 0.10 mm. This would support the idea that these stamps were made from two different types of paper: one with average thickness around 0.08 mm, and one with average thickness just above 0.10 mm. (The relatively small number of stamps of thickness around 0.09 mm supports this: stamps of thickness 0.09 could be unusually thick ones from the 0.08 mm paper, or unusually thin ones from the 0.10 mm paper).

The normal quantile plot doesn’t tell us anything about multimodality, just that the distribution is not normal. It does that very well: it says that the upper tail is just fine for normality, but the lower tail is way too short. You might find that surprising looking at the histogram, where the problem appears to be the long right tail. But that’s all the normal quantile plot does.

Going back to the histogram, you might be wondering whether there are more modes that this histogram hides. This is one of the rare cases where looking at a lot more bins is helpful, in this case 30:

ggplot(stamp, aes(x = Thickness)) + geom_histogram(bins = 30)

Looking at that, you could make the case for as many as 7 modes, though how many of them are real and how many are just chance is another question. In Efron and Tibshirani’s book on the bootstrap (reference below), they use a bootstrap-based method to make the case for 7 modes (and thus 7 different types of paper). Our library has this book, or you may be able to find it in your favourite quasi-legal source. The example starts on page 227.

Reference:

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman and Hall, New York, London

(c)

Extra: if you are interested in that bootstrap sampling distribution:

set.seed(457299)
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(stamp$Thickness, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

The normality here is in fact absolutely fine, so there would be no problem with a \(t\)-test for the mean. This is because the sample size is so large. But I wanted to give you a sign test to do. The mean, though, would be pulled to the right by the skewness, so you could still argue that the median is the right summary statistic and the sign test is the right test.

Note: when I am ending these pipelines with a normal quantile plot, I like to do 10,000 simulations rather than 1000. This is because the tails on this plot can vary more with 1000 simulations, and you might get led astray, being convinced that the distribution is not normal when in fact it is.

(d)

Extra 1: having rejected the null, you might be interested in what the median actually is in 1872. This calls for ci_median, but if you run it the obvious way you get something that seems to make no sense:

ci_median(stamp, Thickness)
[1] 0.08 0.08

Well, the lower and upper limits of the confidence interval can’t be the same, can they?

The resolution comes from realizing about three things:

  • the ends of a CI for the median are data values, or, occasionally, halfway between data values.
  • the data values are given to three decimal places (go back and look)
  • to make ci_median give more decimals, set tol to something smaller. To find out that this is what to do, look at the help for ci_median by typing ?ci_median in the console. This reveals that the accuracy to which the ends of the CI are determined is controlled by tol and the default value of tol is 0.01, which will only give us two decimals.

Hence, what we need is this:

ci_median(stamp, Thickness, tol = 1e-3)
[1] 0.07900391 0.08051318

and therefore the interval is from 0.079 to (apparently) 0.0805. (A data value, or halfway between two data values.) This is a very short interval, because we have such a large sample size. In other words, we are confident that the median is less than 0.081, but we are also very confident that the median is only a little less than 0.081.

Extra 2: if you go back and look at the test results, you see that 207 of the observed thicknesses are above 0.081 and 263 are below (of the 470 that are not exactly equal to 0.081; the others are thrown away). This might seem to you close to a 50–50 split, but in fact the P-value is telling you that if you tossed a fair coin 470 times, you would actually be rather unlikely to observe 207 heads or fewer (“more extreme”). To get a bit of intuition about whether this is reasonable, recall that if the number of trials is large (470; it is), the number of values above the hypothesized median has a binomial distribution that is well approximated by a normal distribution with the same mean and SD as the binomial:

bin_mean <- 470 * 0.5 # np
bin_mean
[1] 235
bin_sd <- sqrt(470 * 0.5 * (1 - 0.5)) # variance is np(1-p)
bin_sd
[1] 10.83974
z <- (207 + 0.5 - bin_mean) / bin_sd
z
[1] -2.536961
pnorm(z)
[1] 0.005590973

and so you see that 207 or fewer observations out of 470 above the null median is rather unlikely if the null median is correct, as you might already suspect from the z that is large in absolute value.2 The probability from this normal approximation is very close to the one-sided P-value that was calculated using the (exact) binomial distribution, indicating that the normal approximation is extremely good here.

4 Canned tuna: Extras

There are rather a lot of these this time:

(a):

Extra: you might be wondering what happens if we try read_delim with no second input here:

my_url <- "http://ritsokiguess.site/datafiles/tuna.csv"
tuna <- read_delim(my_url)
Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): canned_in
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tuna

Exactly the same. If you look at the column specification (click the R Console output above if you don’t see it):

Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): canned_in
dbl (1): price

you see Delimiter: ",", meaning that each data value is separated by a comma, which is exactly what CSV, Comma Separated Values, stands for.

(b):

Extra: we are going to be interested in normality in a moment, so you might (reasonably) argue that we should go straight to normal quantile plots. The best way to do those in a two-sample situation like this is facetted:

ggplot(tuna, aes(sample = price)) + stat_qq() + stat_qq_line() +
  facet_wrap(~canned_in)

These are the same plots you draw later.

(c):

Extra: transforming the data first might help, like the one you did with the logs of the house prices:

ggplot(tuna, aes(x = canned_in, y = log(price))) + geom_boxplot()

Taking logs is not enough, because these distributions are still very much skewed to the right. So we seem to be stuck with something other than a two-sample \(t\)-test.3

(f):

Looking ahead a bit:

The other thing that might have been relevant (but turned out not to be) was that the prices of tuna canned in oil were much more spread out than those canned in water. This shows up on the normal quantile plot as the line for oil on the normal quantile plot being steeper than the one for water. (The two facetted boxplots are on the same scales, so the one that fills more of its facet and has a steeper line is more variable.) This would also show up in the usual way on a boxplot, if you drew one.

This turns out not to be relevant because once we have rejected normality, as we have, it doesn’t matter whether one of the groups is more spread out than the other; the test we’ll see, the Mood median test, will cope perfectly well. You might have run into another nonparametric test for this situation called the rank sum test, or the Mann-Whitney test, or the Wilcoxon test (these are all the same test). For the rank sum test, it matters that both groups have the same spread; the usual way it is phrased is that the two groups have to have the same distribution apart possibly from a difference in medians. I don’t like the rank sum test for this reason; if we can handle a difference in spreads in the \(t\)-test (by using the Welch test), we ought to be able to handle a difference in spreads in the test we use that does not assume normality. I don’t know of a Welch-like version of the rank sum test.

Footnotes

  1. Albeit a curve with a rather sharp corner in this case.↩︎

  2. The extra 0.5 is a continuity correction. The binomial is discrete (the number of “successes” can only be an integer), while the normal is continuous. To approximate a discrete distribution by a continuous one, you have to say that “less than or equal to 207” means, in a continuous distribution, “less than anything that rounds to 207”, ie. \(< 207.5\). In a continuous distribution, you don’t have to worry about \(<\) vs \(\le\).↩︎

  3. There might be a different transformation that would work. But things like 1/price go too far the other way.↩︎