Worksheet 8

Published

October 21, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Hurricanes

The number of hurricanes making landfall on the east coast of the US was recorded each year from 1904 to 2014. The “hurricane season” is from June 1 to November 30 each year. The data are recorded in the file http://ritsokiguess.site/datafiles/hurricanes.csv. There are three columns: the year, the number of hurricanes, and period, in which the years are divided up into 25-year periods.

You might imagine that climate change may cause the number of hurricanes to have changed over time. One way to assess this is to divide the years into a number of groups (such as the 25-year periods here), and then to compare the number of hurricanes in each group. If the groups differ, there is some kind of trend over time.

  1. Read in and display (some of) the data.
my_url <- "http://ritsokiguess.site/datafiles/hurricanes.csv"
hurricanes <- read_csv(my_url)
Rows: 101 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): period
dbl (2): Year, Hurricanes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hurricanes

The three columns as promised, along with one row per year.1

  1. Make a suitable plot of the number of hurricanes by time period.

Time period is categorical, and the number of hurricanes is quantitative, so a boxplot:

ggplot(hurricanes, aes(x = period, y = Hurricanes)) + geom_boxplot()

Extra: Counts of things often have something like a Poisson distribution. The variance of a Poisson distribution is the same as its mean, which would mean that a time period with more hurricanes (on average) would be expected to have a larger spread of hurricane counts, which seems to be what about has happened here. Counts have a lower limit of 0, so you might expect them to be skewed away from zero (that is, to the right).

  1. Run a suitable analysis to compare the numbers of hurricanes in the four time periods. Justify any choices you make. (That is to say, I am not telling you what kind of analysis to run; it is up to you to choose what you think is the best thing to do and to be able to say why you made the choice you did. My solutions to this question go into a fair bit of detail about what your considerations might be.)

This will be some kind of analysis of variance (to compare means or medians of four groups).

The first choice is which type of analysis to run: regular ANOVA, Welch ANOVA, Mood’s median test. To decide that, take a look at your boxplot to decide whether the distributions are normal enough given the sample sizes (25 or so). The Central Limit Theorem will help, so things will not be as bad as they look.

In summary, you need to have a reason for the analysis you do: something along one of these lines:

  • The skewness or outliers are still too much of a problem (in at least one of the groups), and that therefore you should run Mood’s median test.
  • Given the sample sizes, all the distributions are close enough to normal. Then, either:
    • the spreads are close enough to equal (the best argument for this is that the least tall boxes are also from the distributions with outliers, so these at least somewhat balance out; the boxes themselves are definitely not equal in height), therefore run a regular ANOVA.
    • the spreads are not close to equal (by comparing the heights of the boxes), therefore run a Welch ANOVA.

Two points for making a properly reasoned call about the kind of analysis to run. I have no preference about your final choice, as long as your reason for making that choice is sound.

Then, running the analysis that you chose. One of these:

Mood median test:

median_test(hurricanes, Hurricanes, period)
$grand_median
[1] 5

$table
              above
group          above below
  1914 to 1939     6    16
  1940 to 1964    15     7
  1965 to 1989    10     9
  1990 to 2014    16     8

$test
       what      value
1 statistic 9.67324773
2        df 3.00000000
3   P-value 0.02155796

The P-value of 0.0215 is less than 0.05, so the four time periods do not all have the same median.2

Welch ANOVA:

oneway.test(Hurricanes ~ period, data = hurricanes)

    One-way analysis of means (not assuming equal variances)

data:  Hurricanes and period
F = 3.6079, num df = 3.0, denom df = 53.3, p-value = 0.01905

The P-value is in the same ballpark as the one for Mood’s median test (0.019), and again indicates that not all the time periods have the same mean number of hurricanes. Regular ANOVA:

hurricanes.1 <- aov(Hurricanes ~ period, data = hurricanes)
summary(hurricanes.1)
            Df Sum Sq Mean Sq F value  Pr(>F)   
period       3   87.3  29.088    4.52 0.00519 **
Residuals   97  624.2   6.435                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a P-value of 0.0052, the time periods are not all the same in terms of mean number of hurricanes.

Two more points for correctly running your chosen analysis and drawing an appropriate conclusion from it. The danger with this is to go too far here; all you can say is that the means/medians are not all equal.

The last two points are for running an appropriate followup given your choice of test:

If you ran Mood’s median test, pairwise median tests:

pairwise_median_test(hurricanes, Hurricanes, period)

Two of the comparisons are significant: the earliest time period 1914-1939 with the latest one 1990-2014, and with the second time period 1940-1964 (but not with the third one). None of the other differences are significant. What seems to have happened is that the 1914-1939 time period had an unusually low number of hurricanes. Remember to look at the adjusted (for multiple comparisons) P-values in the last column, so that the differences between the third and fourth time periods (in the last row) is not significant.

If you ran a Welch ANOVA, the appropriate followup is Games-Howell:3

gamesHowellTest(Hurricanes ~ factor(period), data = hurricanes)

    Pairwise comparisons using Games-Howell test
data: Hurricanes by factor(period)
             1914 to 1939 1940 to 1964 1965 to 1989
1940 to 1964 0.094        -            -           
1965 to 1989 0.523        0.699        -           
1990 to 2014 0.017        0.567        0.166       

P value adjustment method: none
alternative hypothesis: two.sided

There is only one significant difference now, between the most recent and the oldest time periods. (The first and second time periods are no longer quite significantly different.)

If you ran regular ANOVA, Tukey’s method:

TukeyHSD(hurricanes.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Hurricanes ~ period, data = hurricanes)

$period
                                diff        lwr      upr     p adj
1940 to 1964-1914 to 1939  1.5384615 -0.3190830 3.396006 0.1404182
1965 to 1989-1914 to 1939  0.8984615 -0.9590830 2.756006 0.5876760
1990 to 2014-1914 to 1939  2.5384615  0.6809170 4.396006 0.0030575
1965 to 1989-1940 to 1964 -0.6400000 -2.5156674 1.235667 0.8090281
1990 to 2014-1940 to 1964  1.0000000 -0.8756674 2.875667 0.5063334
1990 to 2014-1965 to 1989  1.6400000 -0.2356674 3.515667 0.1084883

The only significant difference is between the first time period and the last one. None of the other differences are large enough to be significant. (This is the same conclusion as Games-Howell.)

If it so happens that your Mood median test/Welch/regular ANOVA was not significant, then you “do” the followup by explaining why you don’t need to do it (and thus get a rather cheap two points for the third thing).

So, three steps: a reasoned choice of which analysis to do, that analysis, and finally the appropriate followup for that analysis, with conclusions.

Extra 1: another option is to transform the data first in the hopes of getting distributions that are closer to normal. If you go this way, you need to say something about what inspired you to try it (since I haven’t talked about transformations yet), for example that in STAB27 (or your second course), you learned that a transformation of the response can be used to make the data within groups more nearly normal with more nearly equal spreads.4

A transformation that is often useful for counts is the square root:

ggplot(hurricanes, aes(x = period, y = sqrt(Hurricanes))) + geom_boxplot()

Normality is improved only a little, but maybe the spreads are closer to equal once you account for the outliers. If you are prepared to argue that normality is now good enough, you can run regular ANOVA:

hurricanes.2 <- aov(sqrt(Hurricanes) ~ period, data = hurricanes)
summary(hurricanes.2)
            Df Sum Sq Mean Sq F value  Pr(>F)   
period       3  4.432   1.477   4.828 0.00356 **
Residuals   97 29.683   0.306                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

followed by Tukey:

TukeyHSD(hurricanes.2)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = sqrt(Hurricanes) ~ period, data = hurricanes)

$period
                                diff          lwr       upr     p adj
1940 to 1964-1914 to 1939  0.4083197  0.003257533 0.8133818 0.0474176
1965 to 1989-1914 to 1939  0.2685875 -0.136474624 0.6736497 0.3121948
1990 to 2014-1914 to 1939  0.5668695  0.161807375 0.9719317 0.0022990
1965 to 1989-1940 to 1964 -0.1397322 -0.548746219 0.2692819 0.8084570
1990 to 2014-1940 to 1964  0.1585498 -0.250464221 0.5675639 0.7420358
1990 to 2014-1965 to 1989  0.2982820 -0.110732064 0.7072961 0.2321512

or Welch ANOVA, followed by Games-Howell:

oneway.test(sqrt(Hurricanes) ~ period, data = hurricanes)

    One-way analysis of means (not assuming equal variances)

data:  sqrt(Hurricanes) and period
F = 3.6253, num df = 3.000, denom df = 53.137, p-value = 0.01869
gamesHowellTest(sqrt(Hurricanes) ~ factor(period), data = hurricanes)

    Pairwise comparisons using Games-Howell test
data: sqrt(Hurricanes) by factor(period)
             1914 to 1939 1940 to 1964 1965 to 1989
1940 to 1964 0.058        -            -           
1965 to 1989 0.340        0.661        -           
1990 to 2014 0.016        0.722        0.225       

P value adjustment method: none
alternative hypothesis: two.sided

In these cases, the first and last time periods differ significantly in mean numbers of hurricanes, and the first and second time periods are teetering on the brink when it comes to significance.

You might note that the P-values coming out of the analyses I did are all in the same ballpark, and the conclusions we draw are pretty much the same no matter which way I did it. This suggests that it actually didn’t matter much which analysis we did (see the next Extra for some more on this).

Extra 2: as always, when thinking about “normal enough”, you can do bootstrapped sampling distributions of the sample mean(s). The easiest approach is to pick the sample that you think is worst, and do that one. If it’s not good enough, the “sufficient normality” fails and Mood’s median test was the right one after all.5

Outliers are usually more trouble than skewness, so let’s try the 1914-1939 time period. I’m going to draw a normal quantile plot of the results, so I’ll do 10,000 simulations:

hurricanes %>% filter(period == "1914 to 1939") -> early
early
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(early$Hurricanes, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

That is actually pretty good, just a tiny bit right-skewed, but I would be perfectly happy with this. The message from here is that normality is fine (even in what I think is the worst case), so the Welch ANOVA is also fine.6

You might have disagreed with me about the most non-normal group, so I should probably do all four of them, which I think I can do in one shot:

hurricanes %>% nest_by(period) %>% 
  rowwise() %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$Hurricanes, replace = TRUE))) %>%
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() +
    facet_wrap(~ period, scales = "free")

It’s really hard to find much in the way of problems with any of those,7 so that the sample sizes of about 25 really are big enough to overcome the considerable non-normality of the data within each group. This, perhaps, ought to surprise you, but it goes to show just how powerful the Central Limit Theorem is.

The other worst one looks like 1965-1989, which is the other one with an outlier.8

Hence, the “right” answer is to use Welch ANOVA because the four groups do differ in spread. But a well-argued case for one of the other tests is also full credit.

About my code above:

hurricanes %>% nest_by(period)

What this does is to make a list-column containing dataframes with all the rows in hurricanes from the time period shown (thus, about 25 in each case), and all the columns from hurricanes except for the one I nested by (that is, only Year and Hurricanes). The advantage is that the things in data are exactly what I want to sample from in a minute.

hurricanes %>% nest_by(period) %>% 
  rowwise() %>% 
  mutate(sim = list(1:3)) %>% 
  unnest(sim) 

This makes a set of simulations for each time period. I showed you just three, instead of 10,000, so you could see how it worked.

Unnesting undoes rowwise, so now we have another rowwise followed by some code that looks very much like what we are used to. The differences are:

  • each bootstrap sample needs to be of hurricane counts for just that time period, which is the column Hurricanes in the cut-down dataframe data each time.
  • we need a separate normal quantile plot for each period, so we glue a facet_wrap onto the end.

Bread

What makes bread rise? Specifically what are the effects of baking temperature and the amount of yeast on how much a loaf of bread will rise while baking? To find out, a batch of a certain bread mix was divided into 48 parts. Each part had a randomly chosen amount of yeast added (0.75, 1, or 1.25 teaspoons) and was then baked at a temperature of either 350 or 425 (degrees Fahrenheit). After baking, the height of each (very small) loaf of bread was measured (in inches). Apart from the yeast and the baking temperature, the ingredients for each small loaf were identical, so any differences in height can be attributed to one or both of the amount of yeast used and the baking temperature.

The data are in http://ritsokiguess.site/datafiles/bread_wide.csv.

  1. Read in and display (most of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/bread_wide.csv"
bread <- read_csv(my_url)
Rows: 8 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): row, yeast0.75_temp350, yeast0.75_temp425, yeast1_temp350, yeast1_t...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bread

There are only 8 rows, so you should see all of them and most of the columns. The column names are rather long, so you may not see all the columns.

  1. The values in the dataframe are all values of height under various conditions (of yeast and temperature). Rearrange the dataframe so that you have one column of heights and one column of conditions.

This is your standard pivot_longer. Inputs are: the columns to rearrange (everything but row, or starts_with("yeast") or similar if you prefer). I gave you some clues about what you might call the new columns, but ultimately it is your choice:

bread %>% 
  pivot_longer(-row, names_to = "condition", values_to = "height")

Now all the heights are in one column, labelled (next to them) by the treatment combination that produced that height. If you like, go through the original data and the rearranged data and convince yourself that the right things are still matched up.

It is sufficient to display the dataframe you have so far. We are going to do more with it and save it later.

  1. The column containing the treatment conditions that you created in the previous question contains two things, amount of yeast and temperature. Create two new columns, each containing one of these.

This calls for some kind of separate, in particular separate_wider_delim because we want to make a dataframe that is wider than the one we have now, and the current column condition contains the two things we want separated (“delimited”) by something, namely an underscore. Join this onto the end of what you did previously:

bread %>% pivot_longer(-row, names_to = "condition", values_to = "height") %>% 
  separate_wider_delim(condition, "_", names = c("yeast", "temperature"))

Now we have something we can use. You could save this, or you could use the idea in the next question.

  1. Starting from the data you read in from the file, rearrange it so that there is one column of heights, and columns showing the amount of yeast and the temperature that goes with each height. Do your rearrangement with one command. Save your resulting dataframe.

This is one of the variations of pivot_longer, in particular the one with two names_to, because you want a column of yeast amounts and a column of temperatures. The two parts of the (current) column names are separated by an underscore, so:

bread %>% pivot_longer(-row, names_to = c("yeast", "temperature"),
                    names_sep = "_",
                    values_to = "height") -> bread_long
bread_long

Just a (fancy) pivot-longer, nothing else. Compare the code here to your previous pivot_longer: the two things in names_to, and the names_sep to say what they are separated by. This runs an ordinary pivot_longer and the separate_wider_delim together, all in one shot.

  1. Make a suitable graph of the three columns (not including row) in your final dataframe.

These columns are yeast, temperature, and height (as I called them; you can use your own names), two categorical and one quantitative, so a grouped boxplot is called for. To make that, choose one of your categorical variables to be x and the other is fill (or colour); the quantitative variable is y:

ggplot(bread_long, aes(x = yeast, y = height, fill = temperature)) + geom_boxplot()

There are three different values of yeast and only two of temperature, so I put yeast on the \(x\)-axis. It seems better to me to make the best use of the “real-estate” on the \(x\)-axis: there is lots of room for categories there, but having more than a few colours is difficult to sort out.

Having said that, for this question (with only three and two categories), I have no objection if you have three colours:

ggplot(bread_long, aes(fill = yeast, y = height, x = temperature)) + geom_boxplot()

This graph was not so hard to draw once you had the data laid out properly. (This is often the case: most of the work is tidying the data.) I don’t think you could draw the graph at all with the data laid out as it originally was.

Extra: assuming that a bigger height is better, it is best to use more yeast and a lower temperature. The layout of the graphs suggests that a lower temperature is better for any amount of yeast, and more yeast is better at any temperature. That is to say, there is no evidence of an interaction between temperature and yeast (if you think of this as an example of a two-way ANOVA). Further evidence of this is that the three sets of two graphs (or the two sets of three graphs, depending which way you drew it) are more or less parallel to each other, indicating that the effect of one variable does not depend on the level of the other one.

Footnotes

  1. If you stop to think about it, you’ll realize that there are 101 years in this timespan, rather than 100.↩︎

  2. The code looks odd because of my dataframe being called hurricanes and the column of hurricane counts being called Hurricanes, but this is right and it works.↩︎

  3. Remember that for this procedure, the explanatory variable must be a genuine factor rather than a categorical variable that is text, which means you usually have to make it one.↩︎

  4. Or whatever it is you learned there.↩︎

  5. This is the same idea as for a two-sample \(t\)-test; you need both or all of the groups to be normal enough, so as soon as one fails, the whole thing fails.↩︎

  6. You might be logically concerned about only testing one group, but the logic is: if the worst group fails, then at least one group fails, and sufficient normality fails. If the worst group looks normal enough, then all the groups must look normal enough (because I picked the worst one), and therefore normality is OK.↩︎

  7. I think the worst one is actually 1965 to 1989, where the skewed-right shape is the clearest, but none of them are bad.↩︎

  8. Which retroactively confirms what I said before about outliers being worse than skewness.↩︎