STAC32 Assignment 3

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Copper in Wholemeal Flour

24 observations were made of the amount of copper in wholemeal flour, in parts per million. The data are in http://ritsokiguess.site/datafiles/flour.csv.

(1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/flour.csv"
flour <- read_csv(my_url)

Rows: 24 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): copper

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

flour

flour is a good name for the dataframe (it is copper in flour). Calling the dataframe copper is likely to confuse you (and probably your reader), but if you want to call it something like copper_content that is fine.

(2 points) Make a suitable graph of the copper determinations, one that shows the whole distribution reasonably well.

This is most obviously a histogram, but you will undoubtedly need to play with the number of bins. This is probably about where you started:

ggplot(flour, aes(x = copper)) + geom_histogram(bins = 6)

This shows the outlier and two bins for all the other observations, so it does not give a decent picture of the shape of the whole distribution. You need to use more bins than you normally would to get any kind of picture of the distribution as a whole.¹

You can even go nuts with the number of bins:

ggplot(flour, aes(x = copper)) + geom_histogram(bins = 50)

Because of how distant the outlier is, there will be a lot of bins that are completely empty, and to show anything of the shape of the distribution on the left, you will need to account for that. To my mind, this is actually a very sensible number of bins for this dataset. There actually may be another outlier at 5 that shows up by using more bins.

Aside: ggplot will also let you use a binwidth instead of a bins. I mention this because to get a decent picture of the bulk of the distribution on the left, you want about 5 bins to cover 0 to 5, so the bin width ought to be about 1:

ggplot(flour, aes(x = copper)) + geom_histogram(binwidth = 1)

Or you can use a binwidth of less than that, to get an effect more like the 50-bin plot that shows the maybe-outlier at 5. End of aside.

Alternatively, you can do a one-sample boxplot:

ggplot(flour, aes(x = 1, y = copper)) + geom_boxplot()

There actually is, according to the boxplot, a second outlier value, close to 5.

The one-sample boxplot doesn’t really show the distribution of the majority of values, other than to say that most of them are very close together between maybe 2 and 3.

(3 points) Obtain and plot a bootstrap sampling distribution of the sample mean for these data.

We would expect this not to look normal at all, but maybe the sample size of 24 is helpful. With that in mind, I am using slightly more bins in my histogram than I normally would:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(flour$copper, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 12)

That looks more like a hedgehog than a normal distribution!

(2 points) What do you learn from your plot of the previous part?

The histogram of the bootstrapped sampling distribution of the sample mean is also not close to a normal distribution, and hence \(t\) procedures (test or confidence interval) would not be appropriate to make conclusions about the population mean. The sample size of 24 is not by any means large enough to overcome the considerable non-normality in the original data.

Extra: The reason for the odd shape is that the bootstrapped sample mean is heavily dependent on that one very large value; in particular, the mean of a bootstrap sample will depend mostly on how many times that one very large value gets resampled: the taller bars are zero, once, twice, three, and four times.

It is possible to resample that very large value several times, but, as you see, this becomes progressively less likely.²

To confirm that, I can tweak the simulation to count the number of times that large value was observed as well as the mean:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(flour$copper, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  mutate(big_count = sum(my_sample > 20)) %>% 
  ggplot(aes(x = big_count, y = my_mean)) + geom_point()

and then you can see just how much the sample mean was dependent on the number of times that very large value was observed.³ For example, if the mean of a bootstrap sample was 6.5, the high outlier must have appeared in that sample three times.

As a technical detail, my_sample > 20 is either TRUE or FALSE for each observation. If you add these up, TRUE counts as 1 and FALSE as zero, so the sum of my_sample > 20 is the number of TRUE values in the sample: that is, the number of sample values that were bigger than 20, and therefore the number of times that large value (the only one bigger than 20) appeared in the sample.

I am guessing that you would need a much larger sample, in the hundreds or even bigger, before a \(t\)-test would be in any way advisable here. Of course, you could very reasonably take the attitude that the mean is entirely the wrong summary statistic to be using, and we should be working with the median instead. The median will not be affected by the one or two outliers at all (or barely at all). Later, we learn a test called the sign test that can be used to draw conclusions about the population median, when (as here) the population median is the right thing to care about.

Sunspots

Sunspots are, according to Wikipedia:

temporary spots on the Sun’s surface that are darker than the surrounding area. They are regions of reduced surface temperature caused by concentrations of magnetic flux that inhibit convection. … Their number varies according to the approximately 11-year solar cycle.

Astronomers have long been interested in counting sunspots. Records of the number of sunspots every year are in http://ritsokiguess.site/datafiles/sunspot_count.csv. (These appear to be the mean number of sunspots per month in that year, and hence may have decimals.) The file contains: the number of sunspots, the year that the number of sunspots applies to, and an indication of whether the year is before 1900 or after. The years recorded are from 1700 to 1988.

(1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/sunspot_count.csv"
sunspot_count <- read_csv(my_url)

Rows: 289 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): since1900
dbl (2): sunspots, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sunspot_count

The columns as promised, and there are 289 rows because there are 289 years between 1700 and 1988 inclusive. I called my dataframe something other than sunspots because that is the name of one of the columns. If you want a shorter name, sun or spots would do, to indicate what the dataset is.

If you scroll down to about year 1750, you’ll find the first values that have decimals.

(2 points) Some astronomers are interested in whether the average number of sunspots has increased since 1900. Draw a suitable graph that will offer insight into this question.

The issue here is not the specific years, but whether each year is before or after 1900. Hence the graph should feature the since1900 column and the sunspots column, but not the year column.

One categorical column and one quantitative one suggests a boxplot:

ggplot(sunspot_count, aes(x = since1900, y = sunspots)) + geom_boxplot()

That’s all I needed, but there are a couple of things you might observe:

The median number of sunspots is slightly higher after 1900 than it was before.
There appears to be more variability in the post-1900 observations as well.
Both distributions are skewed to the right, with one of the before-1900 observations being a not-very-extreme high outlier.

Thinking a bit further ahead, there were 289 years altogether, so we have some large sample sizes here:

sunspot_count %>% 
  count(since1900)

Extra: after appears (illogically) on the left of my boxplot because it is the first alphabetically. You might note that the data are in year order, and so the order that since1900 appears in the data would be more sensible for the plot, and hence:

ggplot(sunspot_count, aes(x = fct_inorder(since1900), y = sunspots)) + geom_boxplot()

at the expense of an odd \(x\)-axis label, after does now indeed come after before, reading from left to right.

(3 points) Run the most appropriate \(t\)-test, given what the astronomers (mentioned in the previous part) are interested in, justifying your choice of test briefly.

It is a given that we are comparing mean sunspot counts, so that we need some kind of two-sample \(t\)-test. The choices are Welch and pooled, and you need to say why you are doing the one you are doing. From the boxplot, it looks as if the two groups of measurements have different spreads (the after boxplot is taller), and so we need the Welch test and not the pooled test. (One point.)

The astronomers are interested in whether there have been more sunspots since 1900, so that the mean for after should be larger than the mean for before. This means that we should do a one-sided test with an alternative of greater:

t.test(sunspots ~ since1900, data = sunspot_count, alternative = "greater")


    Welch Two Sample t-test

data:  sunspots by since1900
t = 2.6198, df = 132.4, p-value = 0.004912
alternative hypothesis: true difference in means between group after and group before is greater than 0
95 percent confidence interval:
 5.360832      Inf
sample estimates:
 mean in group after mean in group before 
            58.70225             44.12400

(This may seem backwards to you, but what you need to do when figuring out the alternative is to decide which category R thinks is first, which may not be the one you think is first. In this case, after is alphabetically first ahead of before, even though logically before comes first, and the right way around for the alternative is how the category R thinks is first compares to the second one, if the alternative hypothesis is true).

The other two points for getting all of that code right, and getting this output.

(2 points) What do you conclude from your test, in the context of the data? Write your answer in a way that your reader will be convinced by your logic.

The null hypothesis is that the mean numbers of sunspots before and after 1900 are the same.
The alternative is that the mean number of sunspots is greater after 1900 than it is before.
The P-value is 0.0049, which is less than 0.05 (our \(\alpha\)), so we reject the null hypothesis in favour of the alternative.
Thus we conclude that the mean number of sunspots is greater after 1900 than it was before.

Half a point for each of those, either for stating them or for convincing the grader that you know what they are. Convincing the reader of your logic means making it clear that you know what the competing hypotheses are, how you decide between them, and what your decision means in a way that your reader will be able to make a decision. (In this case, the astronomers looking at this dataset might have a theoretical reason for supposing that there are more sunspots now than there used to be, and this test will support that reasoning.)

If you did a two-sided test, you will have lost points in the previous part, but if you correctly draw a two-sided conclusion from the P-value of 0.0098 (“the mean number of sunspots is different after 1900 than it was before”) you get two points here.

Extra: This P-value is a lot smaller than you may have been expecting. The difference (between the medians) on the boxplot didn’t look very convincing, but there is no doubt that this is a significant difference. The reason is probably the large sample sizes; it didn’t take much of a difference to be significant. We can investigate this by finding a confidence interval, which means running t.test again because a confidence interval is two-sided and our test wasn’t:

t.test(sunspots ~ since1900, data = sunspot_count)


    Welch Two Sample t-test

data:  sunspots by since1900
t = 2.6198, df = 132.4, p-value = 0.009825
alternative hypothesis: true difference in means between group after and group before is not equal to 0
95 percent confidence interval:
  3.571265 25.585229
sample estimates:
 mean in group after mean in group before 
            58.70225             44.12400

The mean number of sunspots after 1900 is between about 3.5 and 25.6 bigger than it was before 1900. This confidence interval does not include zero (“it really is bigger after than it was before”), but the interval tells us not very much about how much bigger than zero the difference in means is. (This commonly happens, unless you have really big sample sizes or very little variability).

Note that getting a confidence interval is not a good way of doing a test, for two reasons (here):

the test we wanted to do is one-sided, and confidence intervals are two-sided.
all the confidence interval tells us is that the P-value for a two-sided test is less than 0.05, because it is a 95% CI. (Therefore, with a bit of work involving being on the “correct side”, the one-sided P-value is less than 0.025.) But it doesn’t say how much less, and this matters if your reader wants to use a different \(\alpha\) than the one you chose. What if your reader’s \(\alpha\) was 0.01? Based only on the CI, they would not know whether to reject the null or not. But by giving the accurate P-value, you enable your reader to make their own decision. In this case, by seeing the P-value, they would know that they could also reject the null at \(\alpha = 0.01\).

(2 points) Why do you think the astronomers were happy to run a \(t\)-test, despite the evidence in the graph? Explain briefly.

The relevant evidence in the boxplot is that both distributions of sunspot counts are skewed to the right (not especially close to normal). The official assumption of the two-sample \(t\)-tests is that both distributions are normal, but the Central Limit Theorem says that normality is less important if the sample sizes are large. I found out before that the sample sizes are 89 and 200, which are both large, and I would expect them to be large enough to overcome the skewness that we see in the boxplot.

Note that what matters here is the individual sample sizes, not the total sample size of 289; if the smaller sample had been much smaller, like 20, we would have needed that sample to have been closer to normal as compensation for the smaller sample size, even though the other sample was very large. In other words, we need both samples to be large enough, given the amount of non-normality in them. So the best answer says that both sample sizes are large (and ideally how big they are), not just that there are a lot of observations altogether.

Extra: as ever, we can look at bootstrap sampling distributions to assess whether the sample size really is big enough. The two distributions are about equally skewed, so the one we should worry about is the smaller sample, which is “after”, so grab them first:

sunspot_count %>% 
  filter(since1900 == "after") -> afters
afters

and then work out the bootstrap sampling distribution and assess it for normality:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(afters$sunspots, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

That is hard to argue with, just the tiniest bit right-skewed.

The other thing a two-sample \(t\)-test has going for it is that both samples are skewed the same way, and so the difference in means will be less skewed than either sample individually, and therefore the picture is actually even better than the normal quantile plot we just made.

(3 points) One of the other assumptions of the \(t\)-test is that the observations within each group are independent of each other. By making a time plot of the sunspot counts (joining the points by lines), assess this assumption. Hint: geom_line.

This means to plot sunspot count (quantitative) against year (also quantitative), which means something like a scatterplot:

ggplot(sunspot_count, aes(x = year, y = sunspots)) + geom_point()

It’s hard to make much sense of this (unless you have a better eye than I do), so when the explanatory variable is time (thinking of time as “causing” the number of sunspots), it’s common to join the neighbouring points by a line, which is done with geom_line:

ggplot(sunspot_count, aes(x = year, y = sunspots)) + geom_point() + geom_line()

Two points for a graph like this.

With this graph, it is much easier to see what is happening: the number of sunspots goes gradually up to a maximum and then gradually comes down again. To say it differently, if you know that there were a lot of sunspots one year, you know that there are likely a lot of sunspots the next year, which means that the numbers of sunspots in different years are not independent as required by the \(t\)-test.

Another hint was given in the question. Wikipedia says “Their number varies according to the approximately 11-year solar cycle.” You might also have noticed that the number of sunspots varies in regular “waves” up and down, and the high points are indeed about 11 years apart. This gives another way in which independence fails: if you know how many sunspots there were 11 years ago, you can make a pretty good guess at how many there are now (you would expect to be at the same point on the cycle, low if low and high if high).

Say something about how one value is predictable from another one (and how) for the third point.

Extra: This last graph gives a rather more nuanced view of how the number of sunspots varies over time than our “brute-force” splitting of the dataset into two pieces, “before” and “after”, then lumping all the observations in each category together. It does look as if the more recent waves do indeed go up higher than the earlier ones, but a better test for this would model the waves, perhaps with a sine curve with period around 11 years,⁴ and then seeing whether the amplitude of those sine curves (the top-to-bottom height) is increasing over time. A test like this might be a lot more effective than our \(t\)-test, because we are actually modelling the entire process, and there may not be much variability left over once we have modelled the waves. (This is a theme: the better your model for the process you are investigating, the better your tests will be for the effects you care about.)

Footnotes

I got 6 bins from Sturges’ rule: the next power of 2 above the sample size is \(32 = 2^5\), which says 6 bins. But remember that Sturges’ rule applies only to bell-curve-shaped distributions, which this one most definitely is not, and otherwise you need to use more bins.↩︎
This is such a clear pattern that it would be expected to show up almost no matter what number of bins you choose.↩︎
According to what I was able to find out about this dataset, the very large value was apparently checked and found to be correct, even though it seems to be about ten times too large (and misplacing a decimal point is the kind of thing that happens with real-world data).↩︎
This, if I recall, is grade 10 or 11 math. I remember my daughter did it not too long ago.↩︎