STAC32 Assignment 7

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Hurricanes

The number of hurricanes making landfall on the east coast of the US was recorded each year from 1904 to 2014. The “hurricane season” is from June 1 to November 30 each year. The data are recorded in the file http://ritsokiguess.site/datafiles/hurricanes.csv. There are three columns: the year, the number of hurricanes, and period, in which the years are divided up into 25-year periods.

You might imagine that climate change may cause the number of hurricanes to have changed over time. One way to assess this is to divide the years into a number of groups (such as the 25-year periods here), and then to compare the number of hurricanes in each group. If the groups differ, there is some kind of trend over time.

(a) (1 point) Read in and display (some of) the data.

I am using these packages, so you might like to have them loaded and ready to go:

library(tidyverse)
library(smmr)
library(PMCMRplus)
my_url <- "http://ritsokiguess.site/datafiles/hurricanes.csv"
hurricanes <- read_csv(my_url)
Rows: 101 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): period
dbl (2): Year, Hurricanes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hurricanes

The three columns as promised, along with one row per year.1

(b) (2 points) Make a suitable plot of the number of hurricanes by time period.

Time period is categorical, and the number of hurricanes is quantitative, so a boxplot:

ggplot(hurricanes, aes(x = period, y = Hurricanes)) + geom_boxplot()

Extra: Counts of things often have something like a Poisson distribution. The variance of a Poisson distribution is the same as its mean, which would mean that a time period with more hurricanes (on average) would be expected to have a larger spread of hurricane counts, which seems to be what about has happened here. Counts have a lower limit of 0, so you might expect them to be skewed away from zero (that is, to the right).

(c) (6 points) Run a suitable analysis to compare the numbers of hurricanes in the four time periods. Justify any choices you make.

This will be some kind of analysis of variance (to compare means or medians of four groups).

The first choice is which type of analysis to run: regular ANOVA, Welch ANOVA, Mood’s median test. To decide that, take a look at your boxplot to decide whether the distributions are normal enough given the sample sizes (25 or so). The Central Limit Theorem will help, so things will not be as bad as they look.

In summary, you need to have a reason for the analysis you do: something along one of these lines:

  • The skewness or outliers are still too much of a problem (in at least one of the groups), and that therefore you should run Mood’s median test.
  • Given the sample sizes, all the distributions are close enough to normal. Then, either:
    • the spreads are close enough to equal (the best argument for this is that the least tall boxes are also from the distributions with outliers, so these at least somewhat balance out; the boxes themselves are definitely not equal in height), therefore run a regular ANOVA.
    • the spreads are not close to equal (by comparing the heights of the boxes), therefore run a Welch ANOVA.

Two points for making a properly reasoned call about the kind of analysis to run. I have no preference about your final choice, as long as your reason for making that choice is sound.

Then, running the analysis that you chose. One of these:

Mood median test:

median_test(hurricanes, Hurricanes, period)
$grand_median
[1] 5

$table
              above
group          above below
  1914 to 1939     6    16
  1940 to 1964    15     7
  1965 to 1989    10     9
  1990 to 2014    16     8

$test
       what      value
1 statistic 9.67324773
2        df 3.00000000
3   P-value 0.02155796

The P-value of 0.0215 is less than 0.05, so the four time periods do not all have the same median.2

Welch ANOVA:

oneway.test(Hurricanes ~ period, data = hurricanes)

    One-way analysis of means (not assuming equal variances)

data:  Hurricanes and period
F = 3.6079, num df = 3.0, denom df = 53.3, p-value = 0.01905

The P-value is in the same ballpark as the one for Mood’s median test (0.019), and again indicates that not all the time periods have the same mean number of hurricanes. Regular ANOVA:

hurricanes.1 <- aov(Hurricanes ~ period, data = hurricanes)
summary(hurricanes.1)
            Df Sum Sq Mean Sq F value  Pr(>F)   
period       3   87.3  29.088    4.52 0.00519 **
Residuals   97  624.2   6.435                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a P-value of 0.0052, the time periods are not all the same in terms of mean number of hurricanes.

Two more points for correctly running your chosen analysis and drawing an appropriate conclusion from it. The danger with this is to go too far here; all you can say is that the means/medians are not all equal.

The last two points are for running an appropriate followup given your choice of test:

If you ran Mood’s median test, pairwise median tests:

pairwise_median_test(hurricanes, Hurricanes, period)

Two of the comparisons are significant: the earliest time period 1914-1939 with the latest one 1990-2014, and with the second time period 1940-1964 (but not with the third one). None of the other differences are significant. What seems to have happened is that the 1914-1939 time period had an unusually low number of hurricanes. Remember to look at the adjusted (for multiple comparisons) P-values in the last column, so that the differences between the third and fourth time periods (in the last row) is not significant.

If you ran a Welch ANOVA, the appropriate followup is Games-Howell:3

gamesHowellTest(Hurricanes ~ factor(period), data = hurricanes)

    Pairwise comparisons using Games-Howell test
data: Hurricanes by factor(period)
             1914 to 1939 1940 to 1964 1965 to 1989
1940 to 1964 0.094        -            -           
1965 to 1989 0.523        0.699        -           
1990 to 2014 0.017        0.567        0.166       

P value adjustment method: none
alternative hypothesis: two.sided

There is only one significant difference now, between the most recent and the oldest time periods. (The first and second time periods are no longer quite significantly different.)

If you ran regular ANOVA, Tukey’s method:

TukeyHSD(hurricanes.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Hurricanes ~ period, data = hurricanes)

$period
                                diff        lwr      upr     p adj
1940 to 1964-1914 to 1939  1.5384615 -0.3190830 3.396006 0.1404182
1965 to 1989-1914 to 1939  0.8984615 -0.9590830 2.756006 0.5876760
1990 to 2014-1914 to 1939  2.5384615  0.6809170 4.396006 0.0030575
1965 to 1989-1940 to 1964 -0.6400000 -2.5156674 1.235667 0.8090281
1990 to 2014-1940 to 1964  1.0000000 -0.8756674 2.875667 0.5063334
1990 to 2014-1965 to 1989  1.6400000 -0.2356674 3.515667 0.1084883

The only significant difference is between the first time period and the last one. None of the other differences are large enough to be significant. (This is the same conclusion as Games-Howell.)

If it so happens that your Mood median test/Welch/regular ANOVA was not significant, then you “do” the followup by explaining why you don’t need to do it (and thus get a rather cheap two points for the third thing).

So, three steps: a reasoned choice of which analysis to do, that analysis, and finally the appropriate followup for that analysis, with conclusions.

Extra 1: another option is to transform the data first in the hopes of getting distributions that are closer to normal. If you go this way, you need to say something about what inspired you to try it (since I haven’t talked about transformations yet), for example that in B27, you learned that a transformation of the response can be used to make the data within groups more nearly normal with more nearly equal spreads.4

A transformation that is often useful for counts is the square root:

ggplot(hurricanes, aes(x = period, y = sqrt(Hurricanes))) + geom_boxplot()