STAC33 Assignment 2

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Hurricanes

The number of hurricanes making landfall on the east coast of the US was recorded each year from 1904 to 2014. The “hurricane season” is from June 1 to November 30 each year. The data are recorded in the file http://ritsokiguess.site/datafiles/hurricanes.csv. There are three columns: the year, the number of hurricanes, and period, in which the years are divided up into 25-year periods.

(a) (2 points) Read in and display (some of) the data.

library(tidyverse)
my_url <- "http://ritsokiguess.site/datafiles/hurricanes.csv"
hurricanes <- read_csv(my_url)
Rows: 101 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): period
dbl (2): Year, Hurricanes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hurricanes

The three columns as promised, along with one row per year.1

Note that I called my dataframe hurricanes with a lowercase h, but the number of hurricanes in it for each year is Hurricanes with an uppercase H. R is case-sensitive, so it can tell whether you mean the dataframe or the column in it. Having said that, though, you might prefer to call the dataframe hurricane_counts or something like that.2 Just make sure your name describes in some way what the dataframe contains.

Extra: I did a bit of editing to get the data into this form. The original data came from a textbook by five authors all called Lock (and all related!):

library(Lock5Data)
data("Hurricanes2014")
Hurricanes2014

I divided time into 4 parts of 25 years each (more or less):

Hurricanes2014 %>% 
  mutate(period = cut(
    Year, 
    breaks = c(1913, 1939, 1964, 1989, 2015), 
    labels = c("1914 to 1939", 
               "1940 to 1964", 
               "1965 to 1989", 
               "1990 to 2014"))) -> hurricanes
hurricanes

cut takes something that is numerical (Year) and makes it into categories according to its value, as described in the labels line. I wanted to define four categories, so I had to supply five breakpoints (why?), one below all the data values, one above all of them, and three within the data (why?) to make four groups. The definition of the breaks and labels was rather long, so I split the code up into several lines to make it easier to read.3 (An alternative would have been to define the breaks and labels into variables first, and then I could have done it on one line.)

The intervals as defined by cut are what a mathematician would call “half-open”: the interval excludes the lower value and includes the upper one. Hence 1939 is in the interval starting above 1913, and not in the interval ending in 1964.

This is the dataframe I saved for you.

(b) (3 points) Make a suitable plot of the number of hurricanes every year. It is customary on a time plot to join the points with lines.

Two quantitative variables, so a scatterplot. By tradition, time goes on the \(x\)-axis:

ggplot(hurricanes, aes(x = Year, y = Hurricanes)) + geom_point() + geom_line()