STAC32 Assignment 1

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Body Temperature of a Beaver

Beavers are large semi-aquatic rodents. They live in rivers and lakes, and build dams and lodges using tree branches, rocks, and mud. (Source: Wikipedia.) A study was carried out on beaver body temperature. A beaver’s body temperature was taken every 10 minutes, using a temperature-sensitive radio transmitter. At the same time, the location of the beaver was recorded, and was recorded in activ as no if it was inside the lodge (home), where activity was expected to be low, and yes if it was outside (activity expected to be higher). The data are in http://ritsokiguess.site/datafiles/beaver.csv, with the body temperature (in degrees Celsius) in the column temp.

(a) (3) Read the data directly from the data file into a dataframe, and display some of the dataframe. Hint: “display some of the dataframe” in this course means display 10 rows and as many columns as will fit.

If you don’t already have this at the top of your document, put it there:

library(tidyverse)

and then get to work.

I like to define the URL into a variable with a short name (URLs tend to be long), which you can do by copying and pasting it:

my_url <- "http://ritsokiguess.site/datafiles/beaver.csv"

and then read directly from this URL. This is a .csv file, as you can see from the URL, so use read_csv.1 After that, putting the name you gave the dataframe on a line by itself will display it according to the specifications:

beaver <- read_csv(my_url)
Rows: 100 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): activ
dbl (3): day, time, temp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
beaver

I called my dataframe beaver. You can use whatever name you like, but it is better to use something that describes what the dataframe contains, so that you don’t get confused later.

To back up a bit: when you use read_csv or any other of the read_ things, you get a little report of what it read in: how many rows and columns, what the columns were separated by (in “Delimiter”: the data values in a .csv file like this one are separated by commas), what the columns are called, and what kind of thing they contain. Here activ is text, and the others are (possibly decimal) numbers.2

(b) (3) Make a suitable graph of the body temperatures (only).

The suitable graph of one quantitative variable is a histogram:

ggplot(beaver, aes(x = temp)) + geom_histogram(bins = 8)

Details:

  • in ggplot, the first thing is the name of your dataframe.

  • the next thing is aes for what to plot. A histogram requires an x, which here is the column temp.

  • geom_histogram. This is the “how to plot it”: we are making a histogram of the thing that was in x in the aes.

  • Use a number of bins that clearly displays the shape of the histogram. You must choose a number of bins because the default, 30, is always (for the kind of data we will see) too many. You might have to try several numbers of bins until you find one that gives a good picture of the shape. I think between about 7 and 12 is good.

To think about choosing the number of bins: there are 100 observations. A starting point for the number of bins is to think about the next higher power of 2, which is \(128 = 2^7\). That gives you \(7+1 = 8\) bins as a starting point. This rule is called Sturges’ rule, but it really only works for bell-shaped distributions. When you have something not bell-shaped (like ours), you might need more bins to get a good picture. This is why I said up to about 12 bins was good.

Extra: This is actually a bimodal shape (two peaks). You’ll see in a moment why that is.

(c) (3) The researchers believe that beavers have a higher body temperature on average when active than when they are not. Make a suitable plot to assess this belief.

One quantitative variable, body temperature, and one categorical, activity, so a boxplot:

ggplot(beaver, aes(x = activ, y = temp)) + geom_boxplot()

A boxplot requires an x and a y, with the x being the categorical variable (displayed on the \(x\) axis of the plot) and y being the quantitative one (on the \(y\) axis). There is nothing that you have to tweak with the boxplot (unlike a histogram’s bins).

Extra: ggplot’s boxplots go up and down. To make them go left and right, you can use coord_flip,3 which interchanges the roles of \(x\) and \(y\) on the graph:

ggplot(beaver, aes(x = activ, y = temp)) + 
  geom_boxplot() + coord_flip()

A technicality: you can spread R code over more than one line as you wish, but when you do, you need to make sure each line (apart from the last one) is incomplete, so that R knows there is more to come. With a ggplot, this is most easily done by putting a + on the end of a line.

(d) (2) Is the researchers’ belief justified, from the evidence in your graph? Explain briefly.

The median body temperature is about 37.9 degrees when the beaver is active and about 37.1 degrees when it is not (reading from the graph). On that basis, the researchers’ belief is justified. One point.

More than that, though, the spread in temperatures while the beaver is active and while it is inactive is small (the boxes are not very tall). This means that there are almost two separate groups of temperatures, one around 38 degrees when the beaver is active, and one around 37 degrees when the beaver is not, and there is very little overlap. (This would lead you to suspect that the beaver’s body temperature is significantly higher when active, in the sense of statistical significance, which we will come back to later.)

The second point is for saying something about spread, or about the groups of measurements almost not overlapping, to get at the idea that the difference of less than 1 degree in median may actually be a meaningful difference, even though on the face of it it is not very big.

Extra: now we are in a position to say where that bimodality on the histogram came from. On the boxplot, there appears to be a group of temperatures close to 38 (active) and another one close to 37 (not active), with, perhaps, less in between. Now go back and look at your histogram: there is one peak near 38 degrees and another one near 37, with a “hole” in between to say that there are not many temperatures near 37.5. That’s why the histogram had two peaks; it was really two distinct groups of temperatures mixed together, each with their own (different) peak.

There is a plot called a “density plot” that is a kind of smoothed-out version of a histogram. Here is what it looks like for all the temperatures taken together:

ggplot(beaver, aes(x = temp)) + geom_density()

The two peaks show up very clearly; indeed, it looks exactly like your histogram would if it were smoothed out. But we can do more: we can do a separate density plot for each group and put them behind each other, distinguishing them by colour. This is done by adding the categorical variable as fill:4

ggplot(beaver, aes(x = temp, fill = activ)) +
  geom_density(alpha = 0.5)

Now you see the reason for the two peaks. One of them is when the beaver is active, and one when it is not. The grey area at around 37.5 is actually a blue piece in front of a red piece, indicating that those few temperatures sometimes go with the beaver being active and sometimes with it not being active.

There are two ways to colour things in ggplot. The one we will see most is colour, which for this kind of graph looks like this (I changed fill to colour and left everything else the same apart from alpha which I no longer need):

ggplot(beaver, aes(x = temp, colour = activ)) +
  geom_density()

The principle is that colour colours the outside of the thing, and fill colours the inside of it. We will mostly be using colour for things like points and lines that don’t really have an inside. I like fill better than colour for these density plots, but there is one technical detail: if you don’t have the alpha in there with the fill, the front plot (the blue one) overwrites the red one, and you can’t see the red bit behind it. Using a value of alpha between 0 and 1 makes the graphs partly transparent, so you can see the back one behind the front one.5

A challenge for you: investigate what happens when you add colour or fill to your boxplots. They apply to a categorical variable, activ in this case.

2 Temperatures in Ann Arbor, Michigan

The data in http://ritsokiguess.site/datafiles/aatemp.txt are annual mean temperatures for the city of Ann Arbor (in Michigan, US) for each year that records have been kept up to (in this dataset) 2000. The temperatures are measured in degrees Fahrenheit and are all positive.

(a) (1) Take a look at the data file in your web browser. What do you notice about how one data value is separated from the next?

The data values are separated by a minus sign (that is, each row of the data file has a year, a minus sign, and then the temperature).

Extra: 50 degrees Fahrenheit is about 10 degrees Celsius, which is the right sort of thing for an annual mean temperature. Ann Arbor has cold winters, so the annual mean is usually a bit less than that.

(b) (3) Read the data from the data file into a dataframe, and display some of the dataframe.

First, this:

library(tidyverse)

Put the URL in a variable (URLs tend to be long), and then read directly from the URL. In the previous part, you said that the data values are separated by minus signs, so the right way to read it in is to use read_delim and that the separator is a minus sign:

my_url <- "http://ritsokiguess.site/datafiles/aatemp.txt"
aatemp <- read_delim(my_url, "-")
Rows: 115 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: "-"
dbl (2): year, temp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aatemp

Type the name of the dataframe to display some of it as requested.

Note again that if your reading-in worked, you will get a message that tells you that you have two quantitative columns called year and temp, and that they are “delimited” by a minus sign.

Extra: this doesn’t work, showing that you need to be careful:

aatemp2 <- read_delim(my_url)
Error: Could not guess the delimiter.

Use `vroom(delim =)` to specify one explicitly.

Sometimes read_delim can guess what the data values are separated by, but here it couldn’t. As far as it is concerned, the temperatures might have been negative numbers, and then there would have been nothing between them and the years, not even a space.

In the error message, vroom stands for “whatever read_ thing you are using”.

(c) (3) Make a suitable graph of the two variables in your dataframe.

The two variables are both quantitative, so a scatterplot. On a plot against time, it is standard practice for time to be on the \(x\) axis. I think it is best to join the points by lines so that you can see any time trend more easily:

ggplot(aatemp, aes(x  = year, y = temp)) + geom_point() + geom_line()

but I am also happy without the geom_line:

ggplot(aatemp, aes(x  = year, y = temp)) + geom_point() 

3 Learning to walk

Learning to walk is a major milestone for children, and a major headache for their parents! Most children learn to walk at an age around 12 months. The data in a spreadsheet at http://ritsokiguess.site/datafiles/walking.xls are the results of a study carried out on children long before they learned to walk. There were four treatments, labelled t1 through t4. The workbook has two sheets, one with the names of the four treatments, and a second sheet with the actual data. Our aim is to read the actual data into an R dataframe.

(a) (1) Run the two commands shown in the code chunk below. This will download a copy of the spreadsheet and save it on the machine where you are currently running R.

my_url <- "http://ritsokiguess.site/datafiles/walking.xls"
download.file(my_url, "walking.xls")

As instructed (copy and paste the code):

my_url <- "http://ritsokiguess.site/datafiles/walking.xls"
download.file(my_url, "walking.xls")

Or the other way with mode = "xb". You may see a message saying how big the file is and showing you that it did indeed download.

The grader won’t be able to see that you have done this and it worked, but will give you credit for it if the following parts of the question are correct (on the grounds that you must have done this part of the question correctly). That means that if you got stuck and found another way to get the spreadsheet to the computer on which you are running R, that is also good. (In that case, though, you would do well to say what you did.)

I can show you that the file is indeed in the same folder that I am working:

file.info("walking.xls")

This tells you that the file exists (otherwise you would get missing values everywhere), that it is 11776 bytes in size, is a file rather than a folder (“isdir” stands for “is a directory”), was last modified by me at the date and time under mtime, and so on.

Extra: if you were wondering what the mode = "xb" thing was all about, and why it was specific to Windows, well, the short answer is in the middle of a lot of detail in the help for download.file.6 At the bottom of the Details section it says:

Code written to download binary files must use mode = “wb” (or “ab”), but the problems incurred by a text transfer will only be seen on Windows.

So, without using mode = "wb", I actually got away with it for myself (because I run Linux). I should have included it for myself as well, but any of you running R on your own Windows machine will run into problems.7 The problem is the distinction between a “text file” and a “binary file”. Our data files (apart from this one) are text files; you can open them in Notepad and see that they are just letters and numbers, ordinary text. A spreadsheet is not a text file because it has formatting information (that is hidden in the file) and what you see is not all there is (example: spreadsheet calculations, where what you see immediately is the result, not the formula, unless you go look at it). Things like Word files are also binary files; even though they contain ordinary text, they also contain hidden things like font and text size. One of the key markers of a binary file is that it only works in specific software (like Word or Excel). You could open one of our data files in Word, even, and it will still look all right, but you don’t have to open it in Word (unlike a Word .doc where you really have no choice).

Those two characters in mode = "wb" mean this:

  • w means “write”: when you download the file, make a new file, overwriting any file by the same name that was there before. (There is also a which means “append”: add it onto the end of whatever was in the file before.)
  • b means “binary”, thus giving download.file an explicit heads-up that this is a binary file rather than a text file. Thus for this file, adding the mode = "wb" is safe no matter what operating system R is running on.

One of the good things about .csv files is that they are text files: they really do contain only letters, numbers, and other ordinary symbols. Hence read_csv can work with them whether they are on the computer that R is running on, or on a website somewhere. Thus, in the context of what we have just been discussing, saving as .csv is a way of turning a binary file (a spreadsheet) into a text file (the .csv), and text files are easier to download from websites as far as R is concerned.

(b) (3) Open the spreadsheet in Excel (or whatever you have). What is the name of the sheet that has the actual data in it? Hence, read the data from the spreadsheet you downloaded into an R dataframe and display some of it.

It’s called “data display”, and is the second sheet.

First, make sure you have readxl loaded:

library(readxl)

and then, making sure to say which sheet you want:

walking <- read_excel("walking.xls", sheet = "data display")
walking

You can also refer to the sheet by number:

walking2 <- read_excel("walking.xls", sheet = 2)
walking2

This is indeed (either way) the sheet with the actual data in it (rather than the other sheet with the treatment names in it).

(c) (2) Make a suitable boxplot of the data.

This means having treatment as x and age of walking as y. (I told you what kind of plot to draw this time.)

ggplot(walking, aes(x = treatment, y = age_walk)) + geom_boxplot()

Extra 1: It looks as if treatment 1, whatever it is, goes with the children learning to walk quickest (best). You might think of running an ANOVA on these data, but the high outliers on treatments 1 and 2 (children who learned to walk much later than the other children on the same treatment) ought to make you wonder whether that is a good idea.

It turns out that treatment 1 is a real treatment, whereby the children “had their walking and placing reflexes trained during four three-minute sessions that took place every day from their second to their eighth week of life.” Treatment 2 was a sort of fake version of treatment 1, and treatments 3 and 4 were controls, which makes it not at all surprising that the boxplot came out as it did.

Extra 2: if the treatments are numbered, the boxplot doesn’t come out (and I didn’t want you grappling with that on assignment 1), so I labelled the treatments t1 through t4 so that the variable treatment was obviously categorical and not possibly quantitative (as it might appear if the treatments were numbered 1 through 4).

Footnotes

  1. If you look at the file by opening the URL in your web browser, it will either open in your spreadsheet program, or download for you to do so. But it’s actually a CSV, and you don’t need to do any more than I did to read it in.↩︎

  2. Day and time look like whole numbers, but the read_ things generally read numbers as if they might have decimal places.↩︎

  3. This works on any plot, not just boxplots.↩︎

  4. I explain the alpha thing below.↩︎

  5. Investigate what happens when you leave out alpha, or use a value near 0.↩︎

  6. Which you can access by going down to the Console and typing ?download.file. The same procedure gets you the help file for anything in R. These files have a standard layout, which you have to get used to reading.↩︎

  7. r.datatools runs on a Linux machine, so you should have had no problems there.↩︎