library(tidyverse)
STAC33 Assignment 2
You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html
on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd
), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
1 Hurricanes
The number of hurricanes making landfall on the east coast of the US was recorded each year from 1904 to 2014. The “hurricane season” is from June 1 to November 30 each year. The data are recorded in the file http://ritsokiguess.site/datafiles/hurricanes.csv. There are three columns: the year, the number of hurricanes, and period
, in which the years are divided up into 25-year periods.
(a) (2 points) Read in and display (some of) the data.
<- "http://ritsokiguess.site/datafiles/hurricanes.csv"
my_url <- read_csv(my_url) hurricanes
Rows: 101 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): period
dbl (2): Year, Hurricanes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hurricanes
The three columns as promised, along with one row per year.1
Note that I called my dataframe hurricanes
with a lowercase h
, but the number of hurricanes in it for each year is Hurricanes
with an uppercase H
. R is case-sensitive, so it can tell whether you mean the dataframe or the column in it. Having said that, though, you might prefer to call the dataframe hurricane_counts
or something like that.2 Just make sure your name describes in some way what the dataframe contains.
Extra: I did a bit of editing to get the data into this form. The original data came from a textbook by five authors all called Lock (and all related!):
library(Lock5Data)
data("Hurricanes2014")
Hurricanes2014
I divided time into 4 parts of 25 years each (more or less):
%>%
Hurricanes2014 mutate(period = cut(
Year, breaks = c(1913, 1939, 1964, 1989, 2015),
labels = c("1914 to 1939",
"1940 to 1964",
"1965 to 1989",
"1990 to 2014"))) -> hurricanes
hurricanes
cut
takes something that is numerical (Year
) and makes it into categories according to its value, as described in the labels
line. I wanted to define four categories, so I had to supply five breakpoints (why?), one below all the data values, one above all of them, and three within the data (why?) to make four groups. The definition of the breaks and labels was rather long, so I split the code up into several lines to make it easier to read.3 (An alternative would have been to define the breaks and labels into variables first, and then I could have done it on one line.)
The intervals as defined by cut
are what a mathematician would call “half-open”: the interval excludes the lower value and includes the upper one. Hence 1939 is in the interval starting above 1913, and not in the interval ending in 1964.
This is the dataframe I saved for you.
(b) (3 points) Make a suitable plot of the number of hurricanes every year. It is customary on a time plot to join the points with lines.
Two quantitative variables, so a scatterplot. By tradition, time goes on the \(x\)-axis:
ggplot(hurricanes, aes(x = Year, y = Hurricanes)) + geom_point() + geom_line()
Extra: You will often see a kind of bar chart for data like this. It’s not a real bar chart (and thus ggplot
does it differently), because our dataframe does not have a row for each hurricane, just a summary of how many there were each year.4 Another way to say it is that we already have the counts (that we want to feed to ggplot
) rather than having ggplot
count them for us. This is done using geom_col
rather than geom_bar
, which takes a y
that contains the counts:
ggplot(hurricanes, aes(x = Year, y = Hurricanes)) + geom_col()
This is schematically the same idea as the scatterplot with points joined by lines (the points are the tops of these bars), but there are a lot of years, so there are a lot of bars, so there is a lot of “junk” attached to this plot. I think the points joined by lines is a lot clearer. (Business people like this kind of bar chart.)
(c) (2 points) Comment briefly on any time trends you see in the plot you just made.
Decide whether you see any indication of an increase or decrease (or anything else) over time, and say how you know.
My take is that there is a slight increase over time in average number of hurricanes per year: for example, most of the years with a small number of hurricanes were near the beginning, and most of the years with a large number of hurricanes are more recently.
You can also say that there is not much of a trend, which you can support by saying that there is a lot of variability from one year to the next. I don’t mind which way you go, as long as you say what it is that supports your call.
Extra: in environmental science, the standard way of testing for an increase or decrease over time is called the “Mann-Kendall correlation”. You may have heard of the Kendall correlation as a way of assessing the correlation between two variables without assuming a straight-line relationship (as the regular Pearson correlation does). When one of the variables is time, as here, the name Mann gets attached to the correlation. It can be difficult to look at a time trend like the one we have and decide whether this is more upward than chance, and environmental-science data often tends to look something like the hurricane counts here.
I wrote a package called mkac
, which tests the Mann-Kendall correlation for significance. If you are interested, you can install it thus (uncomment the line of code):
# install.packages("mkac", repos = "nxskok.r-universe.dev")
Then feed kendall_Z_adjusted
the column of hurricane counts:
library(mkac)
with(hurricanes, kendall_Z_adjusted(Hurricanes))
$z
[1] 2.869652
$z_star
[1] 3.479851
$ratio
[1] 0.6800443
$P_value
[1] 0.004109237
$P_value_adj
[1] 0.0005016929
The P-value you want is the bottom one, 0.0005.5 This is very small, so there is definitely more of a trend than you would expect by chance. (Hence, the “right answer” above is that there is a time trend, but my interest above was in whether you could support your decision about a time trend, whatever it was.)
In environmental science, the standard way to assess how big a time trend is is to calculate the “Theil-Sen slope”, a slope that is less affected by outliers than the standard regression slope is (and environmental-science data tends to have a lot of outliers):
<- with(hurricanes, theil_sen_slope(Hurricanes))
tss tss
[1] 0.02439024
That may not look like much, but this is per year, and we have 100 years of data, so the increase in average numbers of hurricanes over the span of our data is
100 * tss
[1] 2.439024
or over 2 hurricanes per year more now than 100 years ago, when it is unusual to have more than 10 hurricanes in an entire year.
Environmental science can be very depressing sometimes.
(d) (3 points) Make a suitable plot of the number of hurricanes each year by time period.
Time period is categorical, and the number of hurricanes is quantitative, so a boxplot:
ggplot(hurricanes, aes(x = period, y = Hurricanes)) + geom_boxplot()
I had to be careful with the question wording here, because I did not want you to count up the total number of hurricanes in each time period and then plot those:
%>%
hurricanes group_by(period) %>%
summarize(total_hurricanes = sum(Hurricanes)) %>%
ggplot(aes(x = period, y = total_hurricanes)) + geom_col()
Note the use of geom_col
again, because after doing the summarize
, we have one number for each time period. But the boxplot gives a more comprehensive picture, including the variability in numbers of hurricanes within each time period.
You could also, I guess, plot histograms (vertically) next to each other using facets:
ggplot(hurricanes, aes(x = Hurricanes)) + geom_histogram(bins = 8) +
facet_wrap(~ period, ncol = 1)
This doesn’t give the easy comparisons of centre and spread that the boxplots do, and so is not as good a plot as a boxplot.
(e) (2 points) Comment on the plot you just made. (Think about centre, spread, and shape.)
Let’s grab the boxplot again so that you can see what I’m talking about:
ggplot(hurricanes, aes(x = period, y = Hurricanes)) + geom_boxplot()
To guide yourself commenting on boxplots, think about centre, then think about spread and shape (these two are often intertwined).
There seems to have been on average a higher number of hurricanes in recent years and a lower number in the first time period. This may indicate an upward trend; the middle two time periods are (in terms of median) the wrong way around, but they might not be significantly different.
As for spread, the two distributions with smallest IQR (the first one and the third one) also have smallest median, and the other two have bigger IQR as well as bigger median. It can be tricky to think about spread when you have outliers, but the outliers here are quite far out, so you might think of those first and third distributions as being not very spread out most of the time, but occasionally you get an isolated value that is very different (much bigger).
All of the distributions are skewed to the right in shape: either they have longer upper tails, or they have upper outliers: a very big value is much more likely than a very small one.
Extra: Counts of things often have something like a Poisson distribution. This distribution is skewed to the right, and the variance of a Poisson distribution is the same as its mean, which would mean that a time period with more hurricanes (on average) would be expected to have a larger spread of hurricane counts, which seems to be what has happened here. Counts have a lower limit of 0, so you might expect them to be skewed away from zero (that is, to the right).
(f) (3 points) Find the median and interquartile range of number of hurricanes in each time period.
This is a group-by and summarize. Give the summaries names that indicate (or at least suggest) what they are median and IQR of:
%>%
hurricanes group_by(period) %>%
summarize(med_canes = median(Hurricanes), iqr_canes = IQR(Hurricanes))
I want the interquartile range rather than the quartiles themselves, but if you forget about IQR
, you could work out the quartiles and then compute the IQR:
%>%
hurricanes group_by(period) %>%
summarize(Q1_canes = quantile(Hurricanes, 0.25),
Q3_canes = quantile(Hurricanes, 0.75)) %>%
mutate(iqr = Q3_canes - Q1_canes)
quantile
comes from the “computing summaries” lecture, but mutate
comes from the “doing things with dataframes” lecture, which you may have seen by the time you are working on this.
Extra: computing the quartiles actually makes the next part easier, because you can then be sure that you are reading the quartiles correctly from the boxplot.
(g) (2 points) How do you know, from your earlier work, that the median and IQR for the 1990–2014 period are correct?
Look at your boxplot, since the boxplot shows the medians and, via the heights of the boxes, the interquartile range. I am repeating my boxplot here:
ggplot(hurricanes, aes(x = period, y = Hurricanes)) + geom_boxplot()
For 1990–2014, the median appears to be 7 (as it is), and it looks as if Q1 is 4 and Q3 is 9, so the IQR should be \(9 - 4 = 5\). Note that the data are numbers of hurricanes, so they must be integers, and that the thin horizontal line between 5 and 10 is at 7.5, so that “just below it” must be 7 and “above it but nearer 10” must be 9 rather than 8.
Footnotes
If you stop to think about it, you’ll realize that there are 101 years in this timespan, rather than 100.↩︎
It is always safe to use an underscore in a variable name, and I think this is clearer than “camel case”,
hurricaneCounts
. It is also safe in R to use a dot to separate words, as inhurricane.counts
, because a dot doesn’t have the special meaning that it does in Python. The modern way is to use an underscore, but you will see some (mainly) old code that uses a dot, such ast.test
which is coming up in the course very soon.↩︎You can split lines of R code anywhere, as long as the code is obviously incomplete, so that R knows that it has to look on the next line for the rest of it. Here, the commas seemed to be a nice place to break things, but in actual fact, the code isn’t complete until the three brackets are closed on the last
labels
line.↩︎The data might have come to us differently: there might be a dataframe with one row for each hurricane, including things like where it made landfall, how strong it was when it did (there is a numerical scale that hurricane people use), the date and time when it made landfall, and any other notes like the hurricane’s name if it had one. If that had been the case, we might have made a different plot.↩︎
The adjustment is for autocorrelation, a concept you’ll have run into if you have studied time series. Here, the autocorrelation is actually positive, meaning that the numbers of hurricanes each year are not independent, but if the number is high one year, it will tend to be high the next year as well. A positive autocorrelation means that the P-value for the Mann-Kendall correlation should be smaller than it would be if you assumed independence, which is why the adjusted P-value is smaller than the unadjusted one above it.↩︎