STAC32 Assignment 6

Wind in Ireland

Daily average wind speeds for 1961-1978 were recorded at 12 weather stations in the Republic of Ireland. Wind speeds are in knots (1 knot = 0.5418 m/s). The data are in http://ritsokiguess.site/datafiles/wind.csv.

Columns are: the year, month, and day of each observation (actual year is year plus 1900), and the daily average wind speed on each day for each of 12 weather stations named in the columns (with three-letter abbreviations). We will be concerned with VAL, Valentia, in the southwest of Ireland, and DUB, Dublin, in the east.

(1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/wind.csv"
wind <- read_csv(my_url)

Rows: 6574 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (15): year, month, day, RPT, VAL, ROS, KIL, SHA, BIR, DUB, CLA, MUL, CLO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

wind

The name wind does not appear in the dataframe anywhere, so you can safely use that as the name of the whole dataframe.

There are a lot of rows, but you should expect there to be, since these are daily data from 18 years, and so there should be about this many observations:

18 * 365

[1] 6570

plus a few more for leap years, and so there are.

(2 points) Why is this matched pairs data? Explain briefly. (You may wish to note that Ireland is a fairly small country.)

Each day produces two observations, one for Valentia and one for Dublin, and observations on the same day are likely to be correlated.

Specifically, if it is windy in one part of Ireland, it is likely to be windy in the rest of Ireland on that same day as well, as shown in Figure 1.

Figure 1: Average daily wind speeds in Valentia and Dublin, and correlation between them

(2 points) Draw a plot that can be used to assess the appropriateness of a matched-pairs \(t\)-test.

The key thing is to look at the differences between Valentia and Dublin on the same day. Since we are interested in the appropriateness of a \(t\)-test, normality is an issue, and so a normal quantile plot is best. I am going to calculate a column of differences first, and save the result:

wind %>% 
  mutate(difference = VAL - DUB) -> wind

and then

ggplot(wind, aes(sample = difference)) + stat_qq() + stat_qq_line()

The second-best graph is a histogram of the differences, because this doesn’t assess normality directly, which is what we care about here:

ggplot(wind, aes(x = difference)) + geom_histogram(bins = 15)

(2 points) The researchers decided to run a matched-pairs \(t\)-test. Why do you think they chose to do this? (There are two things to say, one of them related to the plot you just drew.)

The usual two things relevant to the appropriateness of a \(t\)-test:

the normality (of the differences): the distribution has slightly long tails
the sample size: there are 6574 observations, so the sample size is huge.

The huge sample size means that it does not matter at all that the data distribution is slightly non-normal, because the Central Limit Theorem will give us an enormous amount of help.

If you drew a histogram in the previous part, you probably concluded that it was very much bell-shaped, and if so you need to say here that the normality is just fine and it doesn’t matter that the sample size is large (you need to say both things, to make it clear that you know). Long tails are hard to see on a histogram (it still looks bell-shaped), which is why the normal quantile plot is better if, as here, the normality is the thing of interest to you.

Another way to approach this part is to start with the huge sample size, and say that because the sample size is so large, you don’t need to consider the normality of the differences at all. Again, you need to say both parts, though.

Extra: I didn’t need a bootstrap sampling distribution of the sample mean here; in fact, I tried to word the question to make it clear that you needed to use the normal quantile plot or histogram you just made. But I made one, just to see, and you will be completely unsurprised by how it turns out:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(wind$difference, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

Yeah, that is normal, in case you needed any confirmation that a sample size of 6574 is big enough for the Central Limit Theorem to work.

(3 points) The weather in Ireland tends to blow in off the Atlantic Ocean, so the researchers suspected that the mean wind speed in Valentia would be higher than the mean wind speed in Dublin. Assess the evidence that this is the case. What do you conclude, in the context of the data?

A \(t\)-test, therefore. In the matched pairs case, the alternative is expressed in terms of the column you give first, so you can either do this:

with(wind, t.test(VAL, DUB, paired = TRUE, alternative = "greater"))


    Paired t-test

data:  VAL and DUB
t = 16.445, df = 6573, p-value < 2.2e-16
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 0.7646229       Inf
sample estimates:
mean difference 
      0.8496136

or this:

with(wind, t.test(DUB, VAL, paired = TRUE, alternative = "less"))


    Paired t-test

data:  DUB and VAL
t = -16.445, df = 6573, p-value < 2.2e-16
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
       -Inf -0.7646229
sample estimates:
mean difference 
     -0.8496136

or even this:

with(wind, t.test(difference, mu = 0, alternative = "greater"))


    One Sample t-test

data:  difference
t = 16.445, df = 6573, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 0.7646229       Inf
sample estimates:
mean of x 
0.8496136

with the alternative being consistent with the way around you took differences, if you saved them from when you did your graph.

The P-value is the same, basically zero (less than \(2.2 \times 10^{-16}\)), in both cases. The grader will check (i) that you have an alternative, and, (ii) by looking at the P-value, that you have it the right way around.

Our null hypothesis here was that Valentia and Dublin have the same mean wind speed, with the alternative being that Valentia’s mean wind speed is greater than Dublin’s. With such a small P-value, we reject the null in favour of the alternative, and hence conclude that Valentia has the larger mean wind speed.

I think the with way is better (and less work for you) than using the dollar signs, because in that case you have to name your dataframe twice, and also the output is more cluttered:

t.test(wind$VAL, wind$DUB, paired = TRUE, alternative = "greater")


    Paired t-test

data:  wind$VAL and wind$DUB
t = 16.445, df = 6573, p-value < 2.2e-16
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 0.7646229       Inf
sample estimates:
mean difference 
      0.8496136

Extra 1: A P-value can never be exactly zero, since it is always possible to observe a result as extreme or more extreme than the one you did if the null hypothesis is correct, even if doing so is very unlikely. The number quoted here for the P-value is what is called “double-precision machine epsilon”, a number which to the computer is indistinguishable from zero, in that if you add it to something, you get the something back, not the something plus a little bit:

2 + 2.2e-16

[1] 2

and to confirm that:

.Machine$double.eps

[1] 2.220446e-16

Extra 2: there is a second dataset, which says where the weather stations are:

my_url <- "http://ritsokiguess.site/datafiles/wind_locations.csv"
wind_locations <- read_csv(my_url)

Rows: 12 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Station, Code
dbl (3): Latitude, Longitude, MeanWind

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

wind_locations

This gives the latitudes and longitudes, and full names, of the weather stations. In D29, we learn how to draw things like these on a map. Here is a preview. Hover over the circles to see the name of the weather station they represent:

leaflet(data = wind_locations) %>% 
  addTiles() %>% 
  addCircleMarkers(label = ~Station)

Assuming "Longitude" and "Latitude" are longitude and latitude, respectively

Valentia is the one at the bottom left. When a storm reaches Ireland, it comes in from the southwest (usually), which means it would hit Valentia first and maybe blow itself out some as it crosses Ireland to Dublin (on the right). Ireland is not very big, so a storm that hits Valentia will hit Dublin (with maybe lesser intensity) on the same day. Hence the correlation in wind speeds we saw in Figure 1.

(3 points) Obtain a 99% confidence interval for the mean difference in wind speeds between Valentia and Dublin. Summarize and interpret your confidence interval for your reader.

Run t.test again, but this time two-sided (take out the alternative) and with the right confidence level:

with(wind, t.test(VAL, DUB, paired = TRUE, conf.level = 0.99))


    Paired t-test

data:  VAL and DUB
t = 16.445, df = 6573, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
 0.7164989 0.9827284
sample estimates:
mean difference 
      0.8496136

The mean difference in wind speeds is between 0.72 and 0.98 knots higher in Valentia than in Dublin, with 99% confidence.

This is the logical followup to the test: having found out that the wind speed is higher in Valentia on average, the next question is “how much higher?” This is the answer to that question. The average difference is less than 1 knot (about 1.8 km/h), which isn’t very big, but because we have so much data, the difference is estimated precisely and so it is significantly bigger than zero.¹

Blood coagulation

If you suffer a cut, you will bleed for a time, and then the blood flow will stop. This is called “clotting” or “coagulation”. The blood changes from a liquid to a gel, forming a blood clot, which allows the healing process to begin. This also happens in other mammals.

24 animals were each randomly assigned to one of four different diets. After some time on their diet, animals had blood samples taken, and the blood coagulation time (in seconds) was measured. A lower coagulation time is better (so that in the case of an injury, the healing process can begin sooner). The data are in http://ritsokiguess.site/datafiles/faraway_coagulation.csv. There are two columns, coag that contains the coagulation times, and diet (diets labelled A through D).

I will ask you to run all three of the types of analysis of variance below; do so even if you believe they are the wrong thing to do. You will get a chance to comment later. (This is so you get the practice at running each of them.)

(1 point) Read in and display (some of) the data.

To no-one’s great surprise:

my_url <- 'http://ritsokiguess.site/datafiles/faraway_coagulation.csv'
coagulation <- read_csv(my_url)

Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): diet
dbl (1): coag

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

coagulation

(2 points) Make a suitable plot of the two variables.

One quantitative and one categorical points toward a boxplot:

ggplot(coagulation, aes(x = diet, y = coag)) + geom_boxplot()

Alternatively, looking ahead, you might have seen that normality is something we will be concerned about later, and that might have prompted you to draw a normal quantile plot. The key thing about normality in this context is that you want to assess it within each group, so you actually need a facetted normal quantile plot, like this:

ggplot(coagulation, aes(sample = coag)) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ diet, scales = "free")

If you draw normal quantile plots, you should have a reason for using or not using scales = "free". My take is that including it is better, so that the plots “fill their facets” and you get the best picture about normality (which is the reason you would draw this plot rather than a boxplot). A reason for not using scales = "free" is that you want to compare centre and spread as well:

ggplot(coagulation, aes(sample = coag)) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ diet)

Diets where the plot is higher up in its facet (B and C) have a higher median, and diets that fill more of their facets (D, say) have a larger spread, which you can also see because of a steeper line. I didn’t need comment here, however.

Extra: you might note that with only 24 observations altogether, there are only about 6 in each group, so that the sample sizes are tiny. With that in mind, there seems to be a disturbing number of outliers. The highest values for diets B and C show up as outliers on both the boxplot and the normal quantile plot. We’ll come back to this later.

(3 points) Run a standard analysis of variance and display the results. What do you conclude, in the context of the data?

coag.1 <- aov(coag ~ diet, data = coagulation)
summary(coag.1)

            Df Sum Sq Mean Sq F value   Pr(>F)    
diet         3    228    76.0   13.57 4.66e-05 ***
Residuals   20    112     5.6                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that the mean coagulation time is the same for all four diets, with the alternative hypothesis being that the means are not all the same. With a P-value of \(4.7 \times 10^{-5}\), which is less than 0.05, we reject the null in favour of the alternative. Hence, the diets differ somehow in terms of mean coagulation time.

However, we are not entitled to say that they are all different, which your answer needs to make clear. Saying “the diets are different” is not clear enough: are they all different, or are some of them different from the others?

Tukey is coming up next.

(2 points) If appropriate, run the appropriate followup to your analysis in the previous question. (If not, explain why not.)

TukeyHSD(coag.1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = coag ~ diet, data = coagulation)

$diet
    diff         lwr       upr     p adj
B-A    5   0.7245544  9.275446 0.0183283
C-A    7   2.7245544 11.275446 0.0009577
D-A    0  -4.0560438  4.056044 1.0000000
C-B    2  -1.8240748  5.824075 0.4766005
D-B   -5  -8.5770944 -1.422906 0.0044114
D-C   -7 -10.5770944 -3.422906 0.0001268

I am not asking for comment here (that’s coming up later), but it looks as if diet D has the lowest mean and is significantly lower than everything else except A.

(3 points) Run a Welch ANOVA on these data, displaying the results. Is your conclusion the same as for the standard analysis of variance?

oneway.test(coag ~ diet, data = coagulation)


    One-way analysis of means (not assuming equal variances)

data:  coag and diet
F = 16.728, num df = 3.0000, denom df = 9.9533, p-value = 0.0003249

The P-value is again less than 0.05, so my conclusion is the same. (This last sentence for a cheap third point.)

(2 points) If appropriate, run the appropriate followup to your analysis in the previous question. (If not, explain why not.)

The right followup this time is Games-Howell. Make sure you have package PMCMRplus installed (which you do outside of the Quarto document you hand in, because otherwise you will unnecessarily re-install it every time you re-render your document), and loaded with library(PMCMRplus). Then:

gamesHowellTest(coag ~ factor(diet), data = coagulation)


    Pairwise comparisons using Games-Howell test

data: coag by factor(diet)

  A      B      C     
B 0.0381 -      -     
C 0.0032 0.4843 -     
D 1.0000 0.0287 0.0003


P value adjustment method: none

alternative hypothesis: two.sided

Make sure you have factor(diet) instead of just diet: this implementation of the Games-Howell test is more picky than Tukey is about the type of thing on the right of the squiggle.

The conclusion is more or less the same as the Tukey analysis, as you don’t need to say.

(3 points) Run Mood’s median test on these data, displaying your results. Is your conclusion consistent with the other tests?

Load smmr,² and then:

median_test(coagulation, coag, diet)

$grand_median
[1] 63.5

$table
     above
group above below
    A     0     4
    B     5     1
    C     6     0
    D     1     7

$test
       what        value
1 statistic 17.166666667
2        df  3.000000000
3   P-value  0.000653094

Once again, the P-value is less than 0.05, and hence the conclusion is the same. (Another cheap third point.)

Extra: The table of aboves and belows gives a sense of how the diets compare: A and D are almost all below the grand median, and B and C are almost all above.

(2 points) If appropriate, run the appropriate followup to your analysis in the previous question. (If not, explain why not.)

This time, pairwise median tests is the right followup (and we should again do it because the P-value in the previous question was less than 0.05 and we have some differences to find):

pairwise_median_test(coagulation, coag, diet)

(2 points) Which of your previous analyses is the most appropriate for these data? Explain briefly.

Go back to your graph, and make an assessment of normality and, if necessary, of equal spread. My guess is that you will see the outliers (on whichever graph you did) and the tiny sample sizes and say “not normal”, and therefore the pairing of Mood’s median test and the pairwise median tests is the most appropriate one.

If you come to the conclusion that normality is good enough, then you need to assess equal spreads. On the boxplot, diet C has a smaller interquartile range, but has two outliers; on the normal quantile plots, the slopes of the lines are not very different, though you can certainly argue that diet C has the least steep line. The quality of the argument is more important than the conclusion you come to.

Another way to approach this is via a bootstrap sampling distribution of the sample mean. What you do in this situation is to see which diet is least normal (given its sample size), which appears to be diet C, and investigate that. If that one is not normal enough, the whole normality fails:

coagulation %>% 
  filter(diet == "C") -> diet_c
diet_c

and then

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(diet_c$coag, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is noticeably skewed to the right (the curved pattern), but it is not as bad as you might have expected (the points are fairly close to the line). The horizontal lines are because the coagulation times were only measured to the nearest second (see the display of what I called diet_c) and several of them are identical, so there are relatively few possible bootstrap sample means.

(2 points) Hence, what do you conclude is the best diet or diets for blood coagulation? Use \(\alpha = 0.10\) for this.

Look at the followup test for your preferred analysis, and take the best diet (lowest coagulation time) and all the diets that are not significantly different from it. From the boxplot, the lowest medians are for A and D (it’s difficult to tell which is lower), and these are, at \(\alpha = 0.10\), both significantly better than both of diets B and C (whichever of the tests you thought was best). Hence, either diet A or diet D is the best one. (I threw in the 0.10 thing to make sure they were all consistent, to give you and the grader an easier task.)

It’s important to note that diets A and D are not significantly different, so you have no business favouring one of them over the other one, even if its median or mean is slightly smaller. The results of your test indicate that if you were to run this experiment again, diets B and C would be worst again (most likely), but there is nothing to choose between diets A and D, and the one that was second-best this time might come out best next time.

Footnotes

This is the opposite to the usual situation with modestly-sized samples: in that case, it takes a big difference to be significant, but the confidence interval is typically much wider.↩︎
You should already have this installed from when you ran the sign test.↩︎