STAC33 Assignment 1

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Reading in data from files

(a) (1 point) Using your web browser, take a look at the data in http://ritsokiguess.site/datafiles/ex10.09.txt. What separates one data value from the next?

There are (note for yourself) two columns, one called type which is some text (it looks like the type of grain) and one called thiamin that is a number and looks like a measurement.

They are separated by exactly one space. (That is, the thiamin values do not line up because some of the grain types have longer names than others.)

(b) (3 points) Read the data from http://ritsokiguess.site/datafiles/ex10.09.txt into a dataframe called grain, using something that you learned in this course, and demonstrate that it was read in correctly.

The data values are, as you observed, separated by exactly one space, so that read_delim, with a single space as delimiter, will read them in:

my_url <- "http://ritsokiguess.site/datafiles/ex10.09.txt"
grain <- read_delim(my_url, " ")
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): type
dbl (1): thiamin

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
grain

My standard operating procedure here is this:

  • Save the (usually long) URL into something that I usually call my_url. If you hover over the URL in the question file and right-click on it, you should get a menu that includes “copy link”. Do that, and then paste it into your code.
  • Read the data from the file, in whatever format it is. This means using the right read_ function, along with any extra inputs it needs. Give the dataframe a name that says what it contains (in this case I told you to use the name grain).
  • Display the dataframe, to make sure that I read the right thing (you can scroll through a few rows to get the idea) and that it read in properly.

I do things like this for a couple of reasons:

  • reading data files is something you will be doing all the time, so it is well worth setting up a streamlined process so that you can do it efficiently and without thinking too much, every time you do it.
  • as a reminder to give the dataframe a good name
  • as a reminder to take a look at what I read in from the file, before trying to do anything else with it (like drawing a graph or running some kind of analysis, which will fail if I didn’t read the data in properly).

Extra 1: If you have used R in another course, you might have used something like read.table to read in data like this. In this course, that is wrong. All our data files are read in with a function that starts with read and is followed by an underscore. What I want to see is that you have learned to do things as I do them. Aren’t you are in this course because you want to learn from me?

Extra 2: what happens if you leave out delim? As ever in this course, try it and see:

grain2 <- read_delim(my_url)
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): type
dbl (1): thiamin

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
grain2

It seems to work. If you do it this way in the work you hand in, though, you need to find out and explain why it works (otherwise it looks as if you were just guessing). The best explanation I found is this one:

delim: One or more characters used to delimit fields within a file. If NULL the delimiter is guessed from the set of c(“,”, “\t”, ” “,”|“,”:“,”;“).

This is from the documentation of vroom in the package of the same name, which read_delim uses behind the scenes. What it says is that if you don’t specify the delimiter, read_delim goes looking through the data file to see whether what was used seems to be one of comma, tab, space, vertical line, colon, semicolon, and in this case it succeeds because the data values were separated by single spaces.

(c) (2 points) Using your web browser, take a look at the data in http://ritsokiguess.site/datafiles/ex10.09a.txt. How is this different from the data of (a)?

This time, the columns are lined up (aligned): all of the grain types are vertically below the word type and all of the thiamin values are vertically below the word thiamin (with, in fact, the decimal points aligned).

The consequence of this is that the data values are separated by a variable number of spaces, not exactly one every time as we had before. This will impact how the data are read in.

(d) (3 points) Read the data from http://ritsokiguess.site/datafiles/ex10.09a.txt into a dataframe with a suitable name, using something that you learned in this course, and demonstrate that it was read in correctly.

A sensible name would be one like we had before, but slightly different, to reflect that the data are laid out differently, such as grain2. Since the data are no longer separated by a fixed number of spaces, we need to use read_table rather than read_delim:

my_url <- "http://ritsokiguess.site/datafiles/ex10.09a.txt"
grain2 <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  type = col_character(),
  thiamin = col_double()
)
grain2

This worked. I actually copied and pasted my code from (b), and carefully changed the names that needed changing, including taking out the delimiter " " that we had in read_delim. (read_table uses any amount of whitespace to separate data values: whitespace means spaces, tabs, and even newlines. It handles a lot of other things apart from aligned columns, but it does not handle variable numbers of delimiters that are not whitespace.)

Extra: if you are thinking logically, you might realize that a technique that reads in data separated by at least one space will also read in data separated by exactly one space, and thus would have worked in (b) as well. Is it correct to think this? Try it and see:

my_url <- "http://ritsokiguess.site/datafiles/ex10.09.txt"
grain1a <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  type = col_character(),
  thiamin = col_double()
)
grain1a

It works, but as with using read_delim without an explicitly-stated delim, you need to explain why it works. In this course, you don’t get points for guessing; there are often different acceptable ways to do things, but you need to get used to explaining why your method will work, particularly if what you have done is not what the grader is expecting (which is generally the way that appears in the lecture notes).

(e) (3 points) Some more realistic data, in a .csv file, is in http://ritsokiguess.site/datafiles/choccake.csv. This is the results of an experiment of baking chocolate cake using different recipes, different batches of batter, and different baking temperatures. The outcome variable, breakang in the last column, is the “breaking angle”; a higher value is better. Read in and display some of the data.

This is a .csv file, so we don’t even need to look at it to see what it is. (If you want to, you can, but you will probably find that it gets downloaded and opened in Excel or whatever spreadsheet software you have.)

Hence, read_csv is something you can go straight to.1 This needs a filename (or a URL) only, thus:

my_url <- "http://ritsokiguess.site/datafiles/choccake.csv"
cake <- read_csv(my_url)
Rows: 270 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): recipe, batch, temp
dbl (1): breakang

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cake

There are 270 observations. cake, or anything else that describes the data without duplicating the name of any of the columns, is a good name for the dataframe.

In this course, read.csv with a dot is wrong, for the same reasons that read.table is wrong in this course. (I didn’t teach you either.)

Extra: my source says this about the breaking angle:

Several measurements were made on the cakes. The one shown here is the breaking angle. One half of a slab of cake is held fixed, while the other half is pivoted about the middle until breakage occurs. The angle through which the moving half has revolved is read on a circular scale. Since breakage is gradual, the reading tends to have a subjective element. (A higher breaking angle is considered better.)

I think this is a measure of how “crumbly” the cake is when it is sliced. If you are serving slices of the cake to your guests, you don’t want the slices falling to pieces while your guests are trying to eat them.

(f) (2 points) Make a histogram of the breaking angle values, using 8 bins. (Hint: worksheet 1.)

We haven’t really started graphs yet, but I wanted to give you something a bit more interesting than just reading data in. Hence, I thought I would borrow something you have already seen from worksheet 1:

ggplot(cake, aes(x = breakang)) + geom_histogram(bins = 8)

(g) (3 points) Using the same data and same variable, make a histogram using 20 bins. Which histogram do you prefer, and why? Explain briefly.

For the code, copy and paste and change the number of bins:

ggplot(cake, aes(x = breakang)) + geom_histogram(bins = 20)

Only one point for that, since it is literally copy, paste, and (small) edit.

Our purpose in drawing a histogram is to learn something about the shape of the distribution of breaking angles. (Maybe something about centre and spread, but principally shape.) So a histogram that gives a smoother picture of shape is one that we should prefer. The one with 8 bins (in the previous part) shows pretty clearly how the breaking angle distribution increases quickly to a peak (around 30 degrees) and has a long right tail, so is skewed to the right.

The second histogram, with 20 bins, is much more uneven; the tall peak and the long tail are there, but much less easy to see, and the bars on the histogram jiggle up and down instead of showing a nice smooth trend. So the story that the histogram is telling you here is much less clear.

(Occasionally, but much less often, you really do want to see the details of the shape, and in that case, you might want to use a large number of bins like 20. But this is rare, and you need to have a specific reason for wanting this level of detail.)

Extra: This is a rather simple-minded plot, but I didn’t want to make things too complicated for you yet. In a dataset like this, the point was, for example, to see whether a certain baking temperature was associated with the best (highest) breaking angle. That means adding temp, a categorical variable,2 to our graph. One quantititive variable and one categorical one means a boxplot, thus:

ggplot(cake, aes(x = temp, y = breakang)) + geom_boxplot()

The median breaking angle seems to increase with temperature, maybe until the highest temperature, at which point it seems to level off. All of the distributions appear to be right-skewed, so that the long tail we saw on the histogram is not especially because that graph was made up of breaking angles for a number of different baking temperatures all mixed together; it really is that all the distributions are like that.

Footnotes

  1. Most of the reading of data files that you will do after this assignment will be of .csv files, and the principle is that if it looks like a .csv file and read_csv reads it in properly, then you are good.↩︎

  2. Temperature is actually quantitative, but in a designed experiment like this one with only a few different temperatures, it is usual to treat it as categorical.↩︎