Worksheet 7

Published

October 12, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 The thickness of stamps

Collectors of postage stamps know that the same stamp may be made from several different batches of paper of different thicknesses. Our data set, in http://ritsokiguess.site/datafiles/stamp.csv, contains the thickness, in millimetres, of each of 485 stamps that were printed in 1872. It is suspected that the paper used in that year was thinner than in previous years.

  1. (1 point) Read in and display (some of) the data.

  2. (2 points) Make a suitable graph of these data. Justify your choice briefly.

  3. (2 points) From your graph, why do think it might be a good idea to do a sign test rather than a one-sample \(t\)-test on these data? Explain briefly.

  4. (3 points) The median thickness in years prior to 1872 was 0.081 mm. Is there evidence that the paper on which stamps were printed in 1872 is thinner than in previous years? Explain briefly.

My solutions:

Collectors of postage stamps know that the same stamp may be made from several different batches of paper of different thicknesses. Our data set, in http://ritsokiguess.site/datafiles/stamp.csv, contains the thickness, in millimetres, of each of 485 stamps that were printed in 1872. It is suspected that the paper used in that year was thinner than in previous years.

  1. Read in and display (some of) the data.

Solution

Very much the usual:

my_url <- "http://ritsokiguess.site/datafiles/stamp.csv"
stamp <- read_csv(my_url)
Rows: 485 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): Thickness

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stamp

Note for yourself that there is one column called Thickness with an uppercase T, and that the thicknesses appear to be arranged in ascending order.

\(\blacksquare\)

  1. Make a suitable graph of these data. Justify your choice briefly.

Solution

The obvious justification and graph is that you have one quantitative variable, so that what you need is a histogram. This is absolutely fine:

ggplot(stamp, aes(x = Thickness)) + geom_histogram(bins = 10)

There are almost 500 observations, so you can certainly justify around 10 bins, or a bit less. There is actually a bit of extra shape here that 10 bins helps you see (that maybe-extra-peak around 0.10mm); the guiding principle is that if there are more details of shape that you want to convey, you’ll need more bins that you would if everything were nicely bell-shaped. I discuss that funky shape in an Extra.

Your other option here, now that you’ve seen it in class, is a normal quantile plot. For this, you need to say why specifically normality is of interest to you here. You can do that by looking ahead and seeing that you will be doing a sign test, and at that point you realize the data are probably going to be non-normal, so that normality is a concern here. Or, more simply, you can say that the distribution of thicknesses being normal (or not) will tell you which test to do (a one-sample \(t\) or a sign test). This is why I asked you for a brief justification: you should be choosing a normal quantile plot because what matters to you is whether or not the data values have a normal distribution:

ggplot(stamp, aes(sample = Thickness)) + stat_qq() + stat_qq_line()