STAC32 Assignment 7

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Hurricanes

The number of hurricanes making landfall on the east coast of the US was recorded each year from 1904 to 2014. The “hurricane season” is from June 1 to November 30 each year. The data are recorded in the file http://ritsokiguess.site/datafiles/hurricanes.csv. There are three columns: the year, the number of hurricanes, and period, in which the years are divided up into 25-year periods.

You might imagine that climate change may cause the number of hurricanes to have changed over time. One way to assess this is to divide the years into a number of groups (such as the 25-year periods here), and then to compare the number of hurricanes in each group. If the groups differ, there is some kind of trend over time.

(a) (1 point) Read in and display (some of) the data.

I am using these packages, so you might like to have them loaded and ready to go:

library(tidyverse)
library(smmr)
library(PMCMRplus)
my_url <- "http://ritsokiguess.site/datafiles/hurricanes.csv"
hurricanes <- read_csv(my_url)
Rows: 101 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): period
dbl (2): Year, Hurricanes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hurricanes

The three columns as promised, along with one row per year.1

(b) (2 points) Make a suitable plot of the number of hurricanes by time period.

Time period is categorical, and the number of hurricanes is quantitative, so a boxplot:

ggplot(hurricanes, aes(x = period, y = Hurricanes)) + geom_boxplot()

Extra: Counts of things often have something like a Poisson distribution. The variance of a Poisson distribution is the same as its mean, which would mean that a time period with more hurricanes (on average) would be expected to have a larger spread of hurricane counts, which seems to be what about has happened here. Counts have a lower limit of 0, so you might expect them to be skewed away from zero (that is, to the right).

(c) (6 points) Run a suitable analysis to compare the numbers of hurricanes in the four time periods. Justify any choices you make.

This will be some kind of analysis of variance (to compare means or medians of four groups).

The first choice is which type of analysis to run: regular ANOVA, Welch ANOVA, Mood’s median test. To decide that, take a look at your boxplot to decide whether the distributions are normal enough given the sample sizes (25 or so). The Central Limit Theorem will help, so things will not be as bad as they look.

In summary, you need to have a reason for the analysis you do: something along one of these lines:

  • The skewness or outliers are still too much of a problem (in at least one of the groups), and that therefore you should run Mood’s median test.
  • Given the sample sizes, all the distributions are close enough to normal. Then, either:
    • the spreads are close enough to equal (the best argument for this is that the least tall boxes are also from the distributions with outliers, so these at least somewhat balance out; the boxes themselves are definitely not equal in height), therefore run a regular ANOVA.
    • the spreads are not close to equal (by comparing the heights of the boxes), therefore run a Welch ANOVA.

Two points for making a properly reasoned call about the kind of analysis to run. I have no preference about your final choice, as long as your reason for making that choice is sound.

Then, running the analysis that you chose. One of these:

Mood median test:

median_test(hurricanes, Hurricanes, period)
$grand_median
[1] 5

$table
              above
group          above below
  1914 to 1939     6    16
  1940 to 1964    15     7
  1965 to 1989    10     9
  1990 to 2014    16     8

$test
       what      value
1 statistic 9.67324773
2        df 3.00000000
3   P-value 0.02155796

The P-value of 0.0215 is less than 0.05, so the four time periods do not all have the same median.2

Welch ANOVA:

oneway.test(Hurricanes ~ period, data = hurricanes)

    One-way analysis of means (not assuming equal variances)

data:  Hurricanes and period
F = 3.6079, num df = 3.0, denom df = 53.3, p-value = 0.01905

The P-value is in the same ballpark as the one for Mood’s median test (0.019), and again indicates that not all the time periods have the same mean number of hurricanes. Regular ANOVA:

hurricanes.1 <- aov(Hurricanes ~ period, data = hurricanes)
summary(hurricanes.1)
            Df Sum Sq Mean Sq F value  Pr(>F)   
period       3   87.3  29.088    4.52 0.00519 **
Residuals   97  624.2   6.435                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a P-value of 0.0052, the time periods are not all the same in terms of mean number of hurricanes.

Two more points for correctly running your chosen analysis and drawing an appropriate conclusion from it. The danger with this is to go too far here; all you can say is that the means/medians are not all equal.

The last two points are for running an appropriate followup given your choice of test:

If you ran Mood’s median test, pairwise median tests:

pairwise_median_test(hurricanes, Hurricanes, period)

Two of the comparisons are significant: the earliest time period 1914-1939 with the latest one 1990-2014, and with the second time period 1940-1964 (but not with the third one). None of the other differences are significant. What seems to have happened is that the 1914-1939 time period had an unusually low number of hurricanes. Remember to look at the adjusted (for multiple comparisons) P-values in the last column, so that the differences between the third and fourth time periods (in the last row) is not significant.

If you ran a Welch ANOVA, the appropriate followup is Games-Howell:3

gamesHowellTest(Hurricanes ~ factor(period), data = hurricanes)

    Pairwise comparisons using Games-Howell test
data: Hurricanes by factor(period)
             1914 to 1939 1940 to 1964 1965 to 1989
1940 to 1964 0.094        -            -           
1965 to 1989 0.523        0.699        -           
1990 to 2014 0.017        0.567        0.166       

P value adjustment method: none
alternative hypothesis: two.sided

There is only one significant difference now, between the most recent and the oldest time periods. (The first and second time periods are no longer quite significantly different.)

If you ran regular ANOVA, Tukey’s method:

TukeyHSD(hurricanes.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Hurricanes ~ period, data = hurricanes)

$period
                                diff        lwr      upr     p adj
1940 to 1964-1914 to 1939  1.5384615 -0.3190830 3.396006 0.1404182
1965 to 1989-1914 to 1939  0.8984615 -0.9590830 2.756006 0.5876760
1990 to 2014-1914 to 1939  2.5384615  0.6809170 4.396006 0.0030575
1965 to 1989-1940 to 1964 -0.6400000 -2.5156674 1.235667 0.8090281
1990 to 2014-1940 to 1964  1.0000000 -0.8756674 2.875667 0.5063334
1990 to 2014-1965 to 1989  1.6400000 -0.2356674 3.515667 0.1084883

The only significant difference is between the first time period and the last one. None of the other differences are large enough to be significant. (This is the same conclusion as Games-Howell.)

If it so happens that your Mood median test/Welch/regular ANOVA was not significant, then you “do” the followup by explaining why you don’t need to do it (and thus get a rather cheap two points for the third thing).

So, three steps: a reasoned choice of which analysis to do, that analysis, and finally the appropriate followup for that analysis, with conclusions.

Extra 1: another option is to transform the data first in the hopes of getting distributions that are closer to normal. If you go this way, you need to say something about what inspired you to try it (since I haven’t talked about transformations yet), for example that in B27, you learned that a transformation of the response can be used to make the data within groups more nearly normal with more nearly equal spreads.4

A transformation that is often useful for counts is the square root:

ggplot(hurricanes, aes(x = period, y = sqrt(Hurricanes))) + geom_boxplot()

Normality is improved only a little, but maybe the spreads are closer to equal once you account for the outliers. If you are prepared to argue that normality is now good enough, you can run regular ANOVA:

hurricanes.2 <- aov(sqrt(Hurricanes) ~ period, data = hurricanes)
summary(hurricanes.2)
            Df Sum Sq Mean Sq F value  Pr(>F)   
period       3  4.432   1.477   4.828 0.00356 **
Residuals   97 29.683   0.306                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

followed by Tukey:

TukeyHSD(hurricanes.2)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = sqrt(Hurricanes) ~ period, data = hurricanes)

$period
                                diff          lwr       upr     p adj
1940 to 1964-1914 to 1939  0.4083197  0.003257533 0.8133818 0.0474176
1965 to 1989-1914 to 1939  0.2685875 -0.136474624 0.6736497 0.3121948
1990 to 2014-1914 to 1939  0.5668695  0.161807375 0.9719317 0.0022990
1965 to 1989-1940 to 1964 -0.1397322 -0.548746219 0.2692819 0.8084570
1990 to 2014-1940 to 1964  0.1585498 -0.250464221 0.5675639 0.7420358
1990 to 2014-1965 to 1989  0.2982820 -0.110732064 0.7072961 0.2321512

or Welch ANOVA, followed by Games-Howell:

oneway.test(sqrt(Hurricanes) ~ period, data = hurricanes)

    One-way analysis of means (not assuming equal variances)

data:  sqrt(Hurricanes) and period
F = 3.6253, num df = 3.000, denom df = 53.137, p-value = 0.01869
gamesHowellTest(sqrt(Hurricanes) ~ factor(period), data = hurricanes)

    Pairwise comparisons using Games-Howell test
data: sqrt(Hurricanes) by factor(period)
             1914 to 1939 1940 to 1964 1965 to 1989
1940 to 1964 0.058        -            -           
1965 to 1989 0.340        0.661        -           
1990 to 2014 0.016        0.722        0.225       

P value adjustment method: none
alternative hypothesis: two.sided

In these cases, the first and last time periods differ significantly in mean numbers of hurricanes, and the first and second time periods are teetering on the brink when it comes to significance.

You might note that the P-values coming out of the analyses I did are all in the same ballpark, and the conclusions we draw are pretty much the same no matter which way I did it. This suggests that it actually didn’t matter much which analysis we did (see the next Extra for some more on this).

Extra 2: as always, when thinking about “normal enough”, you can do bootstrapped sampling distributions of the sample mean(s). The easiest approach is to pick the sample that you think is worst, and do that one. If it’s not good enough, the “sufficient normality” fails and Mood’s median test was the right one after all.5

Outliers are usually more trouble than skewness, so let’s try the 1914-1939 time period. I’m going to draw a normal quantile plot of the results, so I’ll do 10,000 simulations:

hurricanes %>% filter(period == "1914 to 1939") -> early
early
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(early$Hurricanes, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

That is actually pretty good, just a tiny bit right-skewed, but I would be perfectly happy with this. The message from here is that normality is fine (even in what I think is the worst case), so the Welch ANOVA is also fine.6

You might have disagreed with me about the most non-normal group, so I should probably do all four of them, which I think I can do in one shot:

hurricanes %>% nest_by(period) %>% 
  rowwise() %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$Hurricanes, replace = TRUE))) %>%
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() +
    facet_wrap(~ period, scales = "free")

It’s really hard to find any problems with any of those, so that the sample sizes of about 25 really are big enough to overcome the considerable non-normality of the data within each group. This, perhaps, ought to surprise you, but it goes to show just how powerful the Central Limit Theorem is.

I think I did pick the worst one to assess before. The second-worst looks like 1965-1989, which is the other one with an outlier.7

Hence, the “right” answer is to use Welch ANOVA because the four groups do differ in spread. But a well-argued case for one of the other tests is also full credit.

About my code above:

hurricanes %>% nest_by(period)

What this does is to make a list-column containing dataframes with all the rows in hurricanes from the time period shown (thus, about 25 in each case), and all the columns from hurricanes except for the one I nested by (that is, only Year and Hurricanes). The advantage is that the things in data are exactly what I want to sample from in a minute.

hurricanes %>% nest_by(period) %>% 
  rowwise() %>% 
  mutate(sim = list(1:3)) %>% 
  unnest(sim) 

This makes a set of simulations for each time period. I showed you just three, instead of 10,000, so you could see how it worked.

Unnesting undoes rowwise, so now we have another rowwise followed by some code that looks very much like what we are used to. The differences are:

  • each bootstrap sample needs to be of hurricane counts for just that time period, which is the column Hurricanes in the cut-down dataframe data each time.
  • we need a separate normal quantile plot for each period, so we glue a facet_wrap onto the end.

2 Nitrofen

Nitrofen is a herbicide that was used extensively for the control of broad-leaved and grass weeds in cereals and rice. Although nitrofen is relatively non-toxic to adult mammals, it can be hazardous to organisms that live in water near to where nitrofen is used. One way to assess toxicity is to compare how well animals exposed to different concentrations reproduce (asexually). In our study, 50 Ceriodaphnia dubia were randomly allocated to one of five concentrations of nitrofen, and the number of live offspring from each animal in three breeding periods (labelled brood1 through brood3) was recorded. The other columns in the dataset are animal, the organisms (numbered 1 through 50), and conc, the concentration of nitrofen in micrograms per litre.

The data are in http://ritsokiguess.site/datafiles/nitrofen.csv.

(a) Read in and display (some of) the data.

Nothing ever changes here:

my_url <- "http://ritsokiguess.site/datafiles/nitrofen.csv"
nitrofen <- read_csv(my_url)
Rows: 50 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): animal, conc, brood1, brood2, brood3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nitrofen

50 rows (50 animals) and the columns promised.

Extra: I had to sort of construct the data description from various sources. If there is anything that makes no biological sense, let me know. Biology was never my strong point.

(b) (4 points) Arrange the data so that the numbers of offspring are all in one column, labelled by which brood they are from. Save your rearranged dataframe and display it.

This is a standard pivot_longer. The inputs are the columns to make longer (the ones whose names start with brood), the new column to put the names in, and the new column to put the values into:

Break the lines where you wish, as long as each one is clearly incomplete. The new column names don’t exist yet, so they need to be in quotes. Any vaguely sensible names for the new columns (that say what they contain) are good:

nitrofen %>% pivot_longer(starts_with("brood"), 
                          names_to = "brood", 
                          values_to = "offspring") -> nitrofen_long
nitrofen_long

Give the “longer” dataframe whatever name you like. The other way of specifying the columns to make longer is brood1:brood3, since they are consecutive columns.

(c) (3 points) Each animal had a number of offspring counted at three separate times. One way of plotting such data, a so-called “spaghetti plot”, is to plot the number of offspring against brood number, with the points for the same animal joined by lines, and the points and lines coloured by the concentration of nitrofen. (Hints: the graph will be easier to read if you use factor(conc); to join the points from the same animal by a line, use group.)

Following the hints (this is not as easy as the graph in the worksheet), you need factor(conc) as the colour and animal as the group:

ggplot(nitrofen_long, aes(x = brood, y = offspring, colour = factor(conc), group = animal)) +
  geom_point() + geom_line()

The reason this one is more challenging is that the colouring is used for a different thing than the joining points with lines, so you cannot use colour for both. group is a general-purpose grouping-together thing that can be used if colour is already taken.

If you forget the factor, this happens:

ggplot(nitrofen_long, aes(x = brood, y = offspring, colour = conc, group = animal)) +
  geom_point() + geom_line()

I’m never a great fan of these, but you can argue that it works (and may even make the next part easier to answer): the highest concentrations are the lightest shade of blue (at the bottom) and the lowest concentrations are the darkest blue (at the top). ggplot tells the difference by seeing whether the input to colour is categorical (which gives you the different colours) or quantitative (shades of blue). Usually colour will be categorical, but this is what happens if it isn’t.

Extra: If you find these colours difficult to distinguish, you can use a different “palette” or collection of colours, for example this one (ideas taken from here):

ggplot(nitrofen_long, aes(x = brood, y = offspring, colour = factor(conc), group = animal)) +
  geom_point() + geom_line() + scale_color_brewer(palette = "Set1")

These are meant to be a set of five (here) colours that are easy to distinguish from each other. (You can decide whether that is true for you.)

The thinking is supposed to be that you use one of the Set palettes with all different colours if you are genuinely distinguishing categories. If you have something that is actually on a continuous scale,8 the story is that you are supposed to use one of the palettes that has one colour at one end and a different colour at the other end, such as this one:

ggplot(nitrofen_long, aes(x = brood, y = offspring, colour = factor(conc), group = animal)) +
  geom_point() + geom_line() + scale_color_brewer(palette = "RdYlGn")

This one starts at red at the low end, and goes via yellow (in the middle) to green at the high end. You might find that this helps you answer the next part more easily (which I explain there).

Another way is to build a colour palette yourself. For example, if you are colour-blind, there might be some colours that you have trouble distinguishing. Colours are specified with hexadecimal (base 16) digits, the first two being the level of red, the second two the level of green, and the last two the level of blue. Here is a colour-blind-friendly palette (so I am told):

cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

When you copy this into R Studio, you see the actual colours these represent as the background of each piece of text. In my solutions, you don’t, but they are respectively: black, orange, light blue, green, yellow, dark blue, red, and lilac. The three two-digit numbers in each colour tell you how they were made, for example this green, 00 9E 73, has no red, some green, and a bit of blue.9

Here is how you use it to make our graph. ggplot uses only the first five colours, since there were only five concentrations:

ggplot(nitrofen_long, aes(x = brood, y = offspring, colour = factor(conc), group = animal)) +
  geom_point() + geom_line() + scale_color_manual(values = cbbPalette)

I find those pretty clearly distinguishable (even though I am not colour-blind).

As you see, there is more to drawing graphs than simply using the colours you get, and there are a lot of design decisions that go into making a good graph.

(d) (2 points) What does your graph tell you about the effect of the concentration of nitrofen on the reproductive success of Ceriodaphnia dubia? Explain briefly.

The animals that had the most offspring are at the top of the graph. These are all trends coloured red or yellow (or brown if you think it’s brown) on my original graph, which go with low concentrations of nitrofen. On the other hand, the animals that have the least offspring, at the bottom of the graph, are on traces coloured blue and pink, which go with the highest concentrations of nitrofen. That is to say, the higher the concentration, the smaller the number of offspring, and so nitrofen is having a negative effect on the population of this animal.

Extra 1: If you changed the colour palette to eg. this one:

ggplot(nitrofen_long, aes(x = brood, y = offspring, colour = factor(conc), group = animal)) +
  geom_point() + geom_line() + scale_color_brewer(palette = "RdYlGn")

you might find it easier to use your graph to answer to answer this question, because the red and orange (low) values are clearly at the top of the graph, and the green (high) values are clearly at the bottom. The same applies if you did your graph with conc rather than factor(conc) and got the graph with shades of blue.

Extra 2: as a result of studies like this one, nitrofen was one of the first pesticides to be withdrawn from commercial use in the US.

Extra 3: I don’t think it actually helped us much to see the pattern in the number of offspring over time. The original dataset had a column called total (the sum of brood1 through brood3) which would probably have told us the same story as above. But we did learn that the number of offspring in brood1 was always pretty small (and that it went up with time at low concentrations and was flat or decreasing at higher concentrations), along with how to handle a dataset like this and to make a new type of graph.

This one looks a bit like matched pairs, in that we have more than one observation per animal (at different times), but we cannot do the usual matched-pairs thing of taking differences because there are more than two observations per animal. We really do, for an analysis, have to handle the fact that we have trios of probably-correlated observations (because they came from the same animal). In D29, we learn about “repeated measures ANOVA” that handles this kind of thing properly.

Footnotes

  1. If you stop to think about it, you’ll realize that there are 101 years in this timespan, rather than 100.↩︎

  2. The code looks odd because of my dataframe being called hurricanes and the column of hurricane counts being called Hurricanes, but this is right and it works.↩︎

  3. Remember that for this procedure, the explanatory variable must be a genuine factor rather than a categorical variable that is text, which means you usually have to make it one.↩︎

  4. Or whatever it is you learned there.↩︎

  5. This is the same idea as for a two-sample \(t\)-test; you need both or all of the groups to be normal enough, so as soon as one fails, the whole thing fails.↩︎

  6. You might be logically concerned about only testing one group, but the logic is: if the worst group fails, then at least one group fails, and sufficient normality fails. If the worst group looks normal enough, then all the groups must look normal enough (because I picked the worst one), and therefore normality is OK.↩︎

  7. Which retroactively confirms what I said before about outliers being worse than skewness.↩︎

  8. An “ordinal variable”, if you like that term. This could be values of a quantitative variable, or it could be levels of a categorical variable that have an order to them, like “low, medium, high”.↩︎

  9. Remember that we are mixing light here, not paint, so that the colour F0E442, which is mostly red and green, comes out yellow rather than brown as it would with paint.↩︎