STAC32 Assignment 1

Packages

library(tidyverse)

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Reading in data

The questions below ask you to read in and display the data described:

(3 points) The data in http://ritsokiguess.site/datafiles/bellpepper.csv come from a study of soil water content of two fields A and B where bell peppers are grown. Read the data into a dataframe and display at least some of it.

The easiest one to begin with. This is a .csv file, which you can tell either from the filename, or by opening it in a web browser. The file will probably open in your spreadsheet application, which also gives you a hint that it came from a spreadsheet originally (so it might be a .csv file). Hence, read_csv:

my_url <- "http://ritsokiguess.site/datafiles/bellpepper.csv"
soil <- read_csv(my_url)

Rows: 30 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): field
dbl (1): water

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

soil

Give your dataframe a descriptive name, like soil. Something like my_data is a bad name, because you (and your reader) will soon forget what it contains.¹

You should see a message saying how many rows and columns got read in (in this case 30 rows and 2 columns), along with a description of what the columns were (a text column called field and a quantitative column called water in this case).

The first thing to do after reading in a data file is to look at the dataframe, to make sure the values in it look sensible.² The easiest way to look at your dataframe is simply to put its name on a line by itself in your code chunk. This will display the first ten rows. If you have the df-print: paged thing working properly, your rendered document will then allow your reader to page through your data values to check them. In this case, the field B values start in row 15, so you’ll need to page down to check that they are indeed there.

Another way you might have seen in the course is this one:

glimpse(soil)

Rows: 30
Columns: 2
$ field <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",…
$ water <dbl> 10.2, 10.7, 15.5, 10.4, 9.9, 10.0, 16.6, 15.1, 15.2, 13.8, 14.1,…

which “transposes” the arrangement so that the variables go down the page and their values go across. Either of these are good here, though just typing the name of the dataframe is usually easier.

Extra: for yourself, you would normally look at your dataframe to check that the values appear to be correct. If you opened the file in your web browser (which may have actually opened it in a spreadsheet), you can physically read through the values to verify that they are correct.

(3 points) The file http://ritsokiguess.site/datafiles/d1.txt contains values for two quantitative variables x and y. Read the data into a dataframe d1, and display the dataframe. (Hint: look at the file with your web browser first.)

This time, the .txt on the filename suggests that we have some work to do to figure out what kind of text file we have. If you look at the file, you’ll see that the numeric values of x and y, as well as the variable names themselves, are separated by a colon, so read_delim is called for:

my_url <- "http://ritsokiguess.site/datafiles/d1.txt"
d1 <- read_delim(my_url, ":")

Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ":"
dbl (2): x, y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

d1

I told you what to call the dataframe this time. Don’t forget the second input to read_delim, which says what the data values are separated by. This one is a deliberately small dataframe, so that all of it gets displayed.

Extra: now that you have gotten read_delim working, you might be wondering what happens if you leave the delimiter out:³

read_delim(my_url)

Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ":"
dbl (2): x, y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It actually still works! What happens is that if you don’t supply a delimiter, read_delim tries to guess what it is (since there must be one). In this case, read_delim guessed right (as you can see under “Column specification”) and it worked. However, you won’t always be so lucky.

The safe way is to explicitly say what the delimiter is. Hence, if you choose to omit the delimiter in this course, you must convince your reader that the delimiter was actually gotten correct (otherwise, it looks as if you don’t understand what you are doing).

(3 points) The file http://ritsokiguess.site/datafiles/d2.txt contains values for a categorical variable group and a quantitative variable y. Read the data into a dataframe d2, and display the dataframe.

The hint for the previous question should suggest to you that once again, looking at the datafile in your web browser should give you a hint about how to read it in. This time, there are two aligned columns separated by a variable amount of whitespace, which is the sort of thing read_table deals with. You can copy and paste your code from the previous question, but make sure you edit everything that needs editing!

my_url <- "http://ritsokiguess.site/datafiles/d2.txt"
d2 <- read_table(my_url)


── Column specification ────────────────────────────────────────────────────────
cols(
  group = col_character(),
  y = col_double()
)

d2

Island birds

The data in http://ritsokiguess.site/datafiles/Sleuth2_ex1027.csv are information about 62 species of European birds. For each of 16 islands over many years, observers noted whether what species were present on that island, and for each species present, how many breeding pairs of that species there were. The data in that file are averages over the sixteen islands for that species. In addition, a “time to extinction” was also calculated for each species, based on the length of time the species had been seen on each island. A larger time to extinction is better, and it is believed that a species with more breeding pairs will have a larger time to extinction.

As well as the columns Pairs and Time described above, the dataset contains these columns:

Species: the name of the bird species
the Size of birds in that species: large (L) or small (S)
the Status of the bird species: resident (R), or migratory (M). Resident birds live on the same island all year, while migratory birds spend part of the year in some other part of the world.

(1 point) Read in and display (some of) the data.

A .csv file, so:

my_url <- "http://ritsokiguess.site/datafiles/Sleuth2_ex1027.csv"
birds <- read_csv(my_url)

Rows: 62 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Species, Size, Status
dbl (2): Time, Pairs

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

birds

as ever, giving your dataframe a suitable name, which doesn’t have to be the same as mine, but should be descriptive of what the dataframe contains.

Data source: ex1027 from the Sleuth2 package.

(1 point) Below, you will be drawing some graphs. Why is it that the column Species does not feature in any of the graphs?

Species is an identifier variable: that is to say, it identifies the observations, and is different for each observation, so that it contains nothing to summarize.

Aside: In the old days of R (that is, before the tidyverse), dataframes often had what was called “row names”, and an identifier variable was the sort of thing that would be used for row names. Row names were sort of a column and sort-of not a column, and the people behind the tidyverse didn’t like this kind of ambiguity, so there are no row names in the tidyverse: they are (as here) normal columns that you have to consider how you want to treat them.

For each of the questions below, draw an appropriate graph of (only) the variables named, explaining (very) briefly why it is appropriate, and give a one-sentence interpretation of the most important feature of the graph, in the context of the data. Two points for the graph in each case (justified choice of graph and drawing the graph), and one for interpretation.

(3 points) Status

To decide what kind of graph to draw, figure out whether the variables concerned are categorical or quantitative, and how many of each you have. In this case, Status is categorical, so you need a bar chart (that counts how many time each Status appears):

ggplot(birds, aes(x = Status)) + geom_bar()

There are more resident bird species than migratory ones (optionally, about 42 resident ones and about 19 migratory).

Here and below, your interpretation needs to talk about what the variables actually represent (not just the name of the variable). This is because you are writing for a reader who is likely interested in birds more than statistics, and the variable names are a convenience for you, the statistician.

(3 points) Pairs

One quantitative variable, so a histogram. Choose an appropriate number of bins: enough to show the shape, but not so many that you lose a clear picture of the shape. I think 7 or 8 is about right. If the grader thinks you have too few or too many bins (based on how your graph looks), expect to earn less than full credit:

ggplot(birds, aes(x = Pairs)) + geom_histogram(bins = 8)

There are a few species that are very common (have a lot of breeding pairs). Or, most of the species have less than about 5 breeding pairs.

If you use a term like “skewed to the right”, explain what that means in this context for the benefit of your readers (who, in this case, would be ornithologists rather than statisticians), along the lines of one of the two interpretations I gave above (talking about the long tail of the distribution or its body, respectively).

(3 points) Pairs and Size

Pairs is quantitative and Size is categorical, so a boxplot:

ggplot(birds, aes(x = Size, y = Pairs)) + geom_boxplot()

Large birds tend to have (slightly) more breeding pairs on average than small ones, but the distribution for small birds has greater variability (and an upper outlier).

The main thing on a boxplot is to compare the medians, so make sure you do that.

(3 points) Pairs and extinction time

These variables are both quantitative, so a scatterplot. The latter is response (read the question) so goes on the \(y\)-axis:

ggplot(birds, aes(x = Pairs, y = Time)) + geom_point()

If the number of breeding pairs is larger, then the extinction time is (usually) larger as well.

(A scatterplot is designed to investigate the relationship between two quantitative variables, so make sure you say something about how they are related here. Optionally, add a geom_smooth() to your graph to help with this assessment.)

(3 points) Size and status

Two categorical variables, so a grouped bar chart. Remember the position = "dodge" to put the bars side by side rather than stacked. You’ll have to make a decision about which variable is x and which is fill. You can justifiably do it either way around here, but your choice affects your interpretation:

ggplot(birds, aes(x = Size, fill = Status)) + geom_bar(position = "dodge")

Most of the bird species are resident rather than migratory, whether they are large or small. Or, if the species is large, it is likely to be resident, and if it is small, it is also more likely to be resident, and to about the same degree.

(The logic is “if thing on \(x\)-axis, then thing in fill.)

Or, put the variables the other way around:

ggplot(birds, aes(x = Status, fill = Size)) + geom_bar(position = "dodge")

A slightly larger number of the bird species are small rather than large, and this is true whether the species is migratory or resident. (If the species is migratory, it is slightly more likely to be small, and the same is true if the species is resident.)

You could alternatively see this as “too many categorical variables”, and repeat your bar chart but with facets:

ggplot(birds, aes(x = Status)) + geom_bar() +
  facet_wrap(~ Size)

with the same interpretation as my (first) grouped bar chart. Or, again, with x and the facetting variable switched around.

Extra 1: the time a plot like this is interesting⁴ is if there is an association between the status of a bird species and its size: for example, if smaller birds also tend to be migratory. That would show up on this kind of graph by one red bar being taller than the blue one, and the other red bar being shorter.

Extra 2: I probably mentioned that I don’t like stacked bars, but I think there is one defensible case: when the bars are scaled to have the same height. This is what position = "fill" does:

ggplot(birds, aes(x = Size, fill = Status)) +        
  geom_bar(position = "fill")

and this provides enough detail to show that out of the large species, slightly more of them are resident than the small species. But the difference is very small, which is why it was hard to see on the side-by-side bars.

(3 points) Extinction time, pairs, and size

Two quantitative variables and one categorical one, so a scatterplot with Size indicated by colour:

ggplot(birds, aes(x = Pairs, y = Time, colour = Size)) +
  geom_point()

Large species tend to have a longer extinction time than small ones, regardless of the number of breeding pairs. (This on the basis that the red dots are usually above the blue ones.)

Again, you can optionally add a geom_smooth to your graph:

ggplot(birds, aes(x = Pairs, y = Time, colour = Size)) +
  geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The red and blue smooth trends are mostly upward, and the red one is above the blue one all the way along.

Footnotes

I am happy with using a disposable name like my_url for the URL of a file, since this is generally used once to read in the data from the file and then never used again. A dataframe, however, is read in from the file and typically used several more times after that, for example to draw a graph or to do an analysis. If re-using my_url bothers you, by all means call this one something like url_soil and the later ones url_d1 and url_d2, for example.↩︎
“Garbage in, garbage out” very much applies to data analysis: if your data are wrong, your entire analysis will be nonsense.↩︎
This time, I didn’t save the dataframe in anything, because all I wanted to do was to see whether the reading-in worked, not do anything with the dataframe I read in.↩︎
This one is rather dull.↩︎