Worksheet 1

Published

December 22, 2023

The questions here have solutions attached. Follow the solutions to see what to do, if you cannot otherwise guess.

It is worth your while to work through these problems, and the ones in future tutorial worksheets, because they will get you used to how R operates, and gain you some comfort in coding simple things. If you do not work through these problems now, any issues that you could have dealt with this week (with help available) come back to bite you when you will have an assignment due. This is stress you would do well to be without.

If you don’t get to the end in tutorial, it’s a good idea to finish them on your own time this week.

1 Using R Studio online

(a) Point your web browser at http://r.datatools.utoronto.ca. Click on the button to the left of “R Studio” (it will show blue), click the orange Log in to Start, and log in using your UTorID and password. (You might see a “CILogon” screen first. If you do, make sure that it says University of Toronto near the bottom and click the green Log On)

This is about what you should see first, before you click the orange thing:

You will see a progress bar as things start up, and then you should see something like this:

This is R Studio, ready to go.

If you are already logged in to something else on the same browser that uses your UTorID and password, you may come straight here without needing to log in again.

(b) Create a new Project for your work in this course. A good name for the project is the course code. See below for how to do this.

Select File and New Project to get this:

Click on New Directory (highlighted blue on mine). This will create a new folder to put your new project in, which is usually what you want to do. The idea is that a project is a container for a larger collection of work, such as all your assignments in this course. That brings you to this:

where you click on New Project (highlighted on mine), and:

Give your project a name, as I did. Then click Create Project. At this point, R Studio will be restarted in your new project. You can tell which project you are in by looking top right, and you’ll see the name of your project next to the R symbol:

(c) One last piece of testing: find the Console window (which is probably on the left). Click next to the blue >, and type library(tidyverse). Press Enter.

It may think a bit, and then you’ll see something like this:

You may have noticed (especially if you pause while typing) that R Studio will offer you some suggestions. If what you want is listed among the suggestions, you can select it by using the up and down arrow if needed, then press Enter to complete what you are typing.

Aside: I used to use a cloud R Studio called rstudio.cloud. If you see or hear any references to that, it means the same thing as R Studio on r.datatools or jupyter. (You can still use rstudio.cloud if you want; it used to be completely free, but now the free tier won’t last you very long; the utoronto.calink is free as long as you are at U of T.) I’m trying to get rid of references to R Studio Cloud as I see them, but I am bound to miss some, and in the lecture videos they are rather hard to find.

Now we can get down to some actual work.

2 Getting started

This question is to get you started using R.

(a) Start R Studio on r.datatools (or on your computer), in your course project that you created in the previous question.

If you just finished the previous question, you have nothing to do. If you shut down R Studio in between, start it up again. It often likes to put you back in the last project you were in (in which case you have nothing more to do), but if it doesn’t, look in the File menu and either Open Project or see if you can find your course project in Recent Projects.

You ought to see something like this:

There should be one thing on the left half, and at the top right it’ll say “Environment is empty”.

Extra: if you want to tweak things, select Tools (at the top of the screen) and from it Global Options, then click Appearance. You can make the text bigger or smaller via Editor Font Size, and choose a different colour scheme by picking one of the Editor Themes (which previews on the right). My favourite is Tomorrow Night Blue. Click Apply or OK when you have found something you like. (I spend a lot of time in R Studio, and I like having a dark background to be easier on my eyes, but I use a black-on-white them e in lecture so that you can see most easily what I am doing.)

(b) We’re going to do some stuff in R here, just to get used to it. First, make a Quarto document by selecting File, New File and Quarto Document.

In the first box that pops up, you’ll be invited to give your document a title. Make something up for now. Down at the bottom, it says “Use Markdown Visual Editor” with a box to the left. Check that box.

The first time, you might be invited to “install some packages” to make the document thing work.1 Let it do that by clicking Yes. After that, you’ll have this:

A couple of technical notes:

  • this should be in the top left pane of your R Studio now, with the Console below it.

  • At the top of the file, between the two lines with three hyphens (minus signs, whatever), is some information about the document, known in the jargon as a YAML block, any of which you can change:

    • the title is whatever title you gave your document

    • the formatis what the output is going to be (in this case, HTML like a webpage, which is mostly what we’ll be using)

    • you should be in a visual editor that looks like Notion or a bit like a Google doc (the default). There is also a Source editor which gives you more control, and shows that underlying the document is a thing called R Markdown (which is a code for writing documents). Switch between them by clicking on Source or Visual as appropriate at the top left of your document.

  • My document is called “My awesome title”, but the file in which the document lives is still untitled because I haven’t saved it yet. See right at the top.

(c) You can delete the template code below the YAML block now (that is, everything from the title “Quarto” to the end). Somewhere in the space opened up below the YAML block (it might say “Heading 2”, greyed out), type a /. This, like Notion, gives you a list of things to choose from to insert there. Pressing Enter will insert a “code chunk”, sometimes known as a “code cell”. We are going to use this in a moment.

Something like this:

The {r} at the top of the code chunk means that the code that will go in there will be R code (you can also have a Python code chunk, among others).

(d) On the line below the {r}, type these two lines of code into the chunk in the Quarto document:

library(tidyverse)
mtcars

What this will do: get hold of a built-in data set with information about some different models of car, and display it.

In approximately five seconds, you’ll be demonstrating that for yourself.

(e) Run this command. To do that, look at the top right of your code chunk block (shaded in a slightly different colour). You should see a down arrow and a green “play button”. Click the play button. This will run the code, and show the output below the code chunk.

Here’s what I get (yours should be the same):

This is a rectangular array of rows and columns, with individuals (here, cars) in rows and variables in columns, known as a “dataframe”. When you display a dataframe in an Quarto document, you see 10 rows and as many columns as will fit on the screen. At the bottom, it says how many rows and columns there are altogether (here 32 rows and 11 columns), and which ones are being displayed.

You can see more rows by clicking on Next, and if there are more columns, as there are here, you’ll see a little arrow next to the rightmost column (as here next to am) that you can click on to see more columns. Try it and see. Or if you want to go to a particular collection of rows, click one of the numbers between Previous and Next: 1 is rows 1–10, 2 is rows 11–20, and so on.

The column on the left without a header (containing the names of the cars) is called “row names”. These have a funny kind of status, kind of a column and kind of not a column; usually, if we need to use the names, we have to put them in a column first.

In future solutions, rather than showing you a screenshot, expect me to show you something like this:

library(tidyverse)
mtcars

The top bit is the code, the bottom bit the output. In this kind of display, you only see the first ten rows (by default).2

If you don’t see the “play button”, make sure that what you have really is a code chunk. If you can’t figure it out, delete this code chunk and make a new one. Sometimes R Studio gets confused.

On the code chunk, the other symbols are the settings for this chunk (you have the choice to display or not display the code or the output or to not actually run the code). The second one, the down arrow, runs all the chunks prior to this one (but not this one).

Your output has its own little buttons (as seen on the screenshot). The first one pops the output out into its own window; the second one shows or hides the output, and the third one deletes the output (so that you have to run the chunk again to get it back). Experiment. You can’t do much damage here.

(f) Something a little more interesting: summary obtains a summary of whatever you feed it (the five-number summary plus the mean for numerical variables). Obtain this for our data frame. To do this, create a new code chunk below the previous one, type summary(mtcars) into the code chunk, and run it.

This is what you should see:

or the other way:

summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

For the gas mileage column mpg, the mean is bigger than the median, and the largest value is unusually large compared with the others, suggesting a distribution that is skewed to the right.

There are 11 numeric (quantitative) variables, so we get the five-number summary plus mean for each one. Categorical variables, if we had any here, would be displayed a different way.

(g) Let’s make a histogram of the gas mileage data. Type the code below into another new code chunk, and run it:

ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 8)

The code looks a bit wordy, but we’ll see what all those pieces do later in the course (like, maybe tomorrow).

Solution

This is what you should see:

ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 8)

The long right tail supports our guess from before that the distribution is right-skewed.

(h) Some aesthetics: Add some narrative text above and below your code chunks. Above the code chunk is where you say what you are going to do (and maybe why you are doing it), and below is where you say what you conclude from the output you just obtained. I find it looks better if you have a blank line above and below each code chunk.

This is what I wrote (screenshot), with none of the code run yet. My library(tidyverse) line seems to have disappeared, but yours should still be there:

(i) Save your Quarto document (the usual way with File and Save). This saves it on the jupyter servers (and not on your computer). This means that when you come back to it later, even from another device, this notebook will still be available to you. (This of course does not apply if you are running R Studio on your own computer.) Now click Render. This produces a pretty HTML version of your Quarto document. This will appear in a new tab of your web browser,3 which you might need to encourage to appear (if you have a pop-up blocker) by clicking a Try Again.

If there are any errors in the rendering process, these will appear in the Render tab. The error message will tell you where in your document your error was. Find it and correct it.4 Otherwise, you should see your document.

(j) The rendering process as you did it doesn’t produce that nice display of a dataframe that I had in one of my screenshots. To get that, alter the YAML block to read as below. Re-render, and note what it does.

format: 
  html:
     df-print: paged
     embed-resources: true

Note now that anyone reading your document can actually page through the dataframes you display in the same way that you did, to check that they contain the right things.

You should have this in the YAML block at the top of each assignment you do, so that the grader can check that your dataframes are what you say they are, and for another reason that I explain below.

Extra 1: you might prefer to have a preview of your document within R Studio. To make this happen if it doesn’t by itself, look for the gear wheel to the right of Render. Click the arrow beside it, and in the drop-down, click on Preview in Viewer Pane. Render again, and you’ll see the rendered version of your document in a Viewer pane on the right. This puts the thing you’re writing and what it will look like side by side.

Extra 2: you might be annoyed at having to remember to save things. If you are, you can enable auto-saving. To do this, go to Tools and select Global Options. Select Code (on the left) and Saving (at the top). Click on Automatically Save when editor loses focus, to put a check mark in the box on the left of it. Change the pull-down below that to Save and Write Changes. Click OK. Now, as soon as you pause for a couple of seconds, everything unsaved will be saved. This does not apply to “untitled” documents, since R Studio has no idea where to save them until you give the file a name. (This is a good reason to give your files names and save them once as soon as possible.)

(k) Practice handing in your rendered Quarto document, as if it were an assignment that was worth something. (It is good to get the practice in a low-stakes situation, so that you’ll know what to do next week.)

See below for what to do if you are running R Studio on your computer, rather than on r.datatools.

There are two steps: download the HTML file onto your computer, and then handing it in on Quercus. To download: find the HTML file that you want to download in the Files pane on the right. You might need to click on Files at the top, especially if you had a Viewer open there before:

I called my Quarto document awesomeand the file I was working on was called awesome.qmd (the extension stands for “Quarto Markdown”). That’s the file I had to render to produce the output. My output file itself is called awesome.html.That’s the file I want to hand in. If you called your file something different when you saved it, that’s the thing to look for: there should be something ending in .qmd and something with the same first part ending in .html.

Click the checkbox to the left of the HTML file. Now click on More above the bottom-right pane. This pops up a menu from which you choose Export. This will pop up another window called Export Files, where you put the name that the file will have on your computer. (I usually leave the name the same.) Click Download. The file will go to your Downloads folder, or wherever things you download off the web go.

If you are working on your computer, you will have created a new folder when you created your project, and the html file you made by rendering will be in that folder. Thus, you can hand it in (below) directly without having to download it first.

Now, to hand it in. Open up Quercus at q.utoronto.ca, log in and navigate to this course. Click Assignments. Click (the title of) Assignment 0. There is a big blue Start Assignment button top right. Click it. You’ll get a File Upload at the bottom of the screen. Click Choose File and find the HTML file that you downloaded. Click Open (or equivalent on your system). The name of the file should appear next to Choose File. Click Submit Assignment. You’ll see Submitted at the top right, and below that is a Submission Details window and the file you uploaded.

You should be in the habit of always checking what you hand in, by downloading it again and looking at it to make sure it’s what you thought you had handed in. One reason for doing so is that if you don’t have the line

     embed-resources: true

in your YAML header, you will find that the file you handed in has lost all its graphs,5 which is of course a big problem on an assignment.

If you want to try this again, you can try again as many times as you like, by making a New Attempt. (For the real thing, you can use this if you realize you made a mistake in something you submitted. The graders’ instructions, for the real thing, are to grade the last file submitted, so in that case you need to make sure that the last thing submitted before the due date includes everything that you want graded. My assignments have unlimited attempts, so you don’t have to ask me for another one.

(k) Something more ambitious: make a scatterplot of gas mileage mpg, on the \(y\) axis, against horsepower, hp, on the \(x\)-axis.

That goes like this. I’ll explain the steps below.

library(tidyverse)
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()

This shows a somewhat downward trend, which is what you’d expect, since a larger hp value means a more powerful engine, which will probably consume more gas and get fewer miles per gallon. As for the code: to make a ggplot plot, as we will shortly see in class, you first need a ggplot statement that says what to plot. The first thing in a ggplot is a data frame (mtcars here), and then the aes says that the plot will have hp on the \(x\)-axis and mpg on the \(y\)-axis, taken from the data frame that you specified. That’s all of the what-to-plot. The last thing is how to plot it; geom_point() says to plot the data values as points.

You might like to add a regression line to the plot. That is a matter of adding this to the end of the plotting command:

ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

The line definitely goes downhill. Decide for yourself how well you think a line fits these data.

3 Reading data from a file

In this question, we read a file from the web and do some descriptive statistics and a graph. This is very like what you will be doing on future assignments, so it’s good to practice it now.

Take a look at the data file at http://ritsokiguess.site/datafiles/jumping.txt. These are measurements on 30 rats that were randomly made to do different amounts of jumping by group (we’ll see the details later in the course). The control group did no jumping, and the other groups did “low jumping” and “high jumping”. The first column says which jumping group each rat was in, and the second is the rat’s bone density (the experimenters’ supposition was that more jumping should go with higher bone density).

(a) What are the two columns of data separated by? (The fancy word is “delimited”).

Exactly one space. This is true all the way down, as you can check.

(b) Make a new Quarto document. Leave the YAML block, but get rid of the rest of the template document. Start with a code chunk containing library(tidyverse). Run it.

You will get either the same message as before or nothing. (I got nothing because I had already loaded the tidyverse in this session.)

(c) Put the URL of the data file in a variable called my_url. Then use read_delim to read in the file. (See solutions for how.) read_delim reads data files where the data values are always separated by the same single character, here a space. Save the data frame in a variable rats.

Like this:

my_url <- "http://ritsokiguess.site/datafiles/jumping.txt"
rats <- read_delim(my_url," ")
Rows: 30 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): group
dbl (1): density

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The second thing in read_delim is the thing that separates the data values. Often when you use read_delim it’ll be a space.

Hint: to get the file name into my_url, the best way is to right-click on the link, and select Copy Link Address (or equivalent in your browser). That puts in on your clipboard. Then make a code chunk and put this in it (you’ll probably only need to type one quote symbol, because R Studio will supply the other one):

my_url <- ""

then put the cursor between the two quote symbols and paste. This is better than selecting the URL in my text and then copy-pasting that because odd things happen if it happens to span two lines on your screen. (URLs tend to be rather long, so this is not impossible.)

(d) Take a look at your data frame, by making a new code chunk and putting the data frame’s name in it (as we did with mtcars).

rats

There are 30 rows and two columns, as there should be.

(e) Find the mean bone density for rats that did each amount of jumping.

This is something you’ll see a lot: group_by followed by summarize. Reminder: to get that funny thing with the percent signs (called the “pipe symbol”), type control-shift-M (or equivalent on a Mac):

rats %>% group_by(group) %>%
summarize(m = mean(density))

The mean bone density is clearly highest for the high jumping group, and not much different between the low-jumping and control groups.

(f) Make a boxplot of bone density for each jumping group.

On a boxplot, the groups go across and the values go up and down, so the right syntax is this:

ggplot(rats, aes(x=group, y=density)) + geom_boxplot()

Given the amount of variability, the control and low-jump groups are very similar (with the control group having a couple of outliers), but the high-jump group seems to have a consistently higher bone density than the others.

This is more or less in line with what the experimenters were guessing, but it seems that it has to be high jumping to make a difference.

You might recognize that this is the kind of data where we would use analysis of variance, which we will do later on in the course: we are comparing several (here three) groups.

4 If you want more practice, work through question 3.4 of PASIAS.

Footnotes

  1. Especially if you are on your own computer.↩︎

  2. This document was actually produced by literally running this code, a process known as “rendering”, which we will learn about shortly.↩︎

  3. Or possibly in the Viewer tab of R Studio, depending on how things are set up.↩︎

  4. A big part of coding is dealing with errors. You will forget things, and it is fine. In the same way that it doesn’t matter how many times you get knocked down, it’s key that you get up again each time: it doesn’t matter how many errors you made, it’s key that you fix them. If you want something to sing along with while you do this, I recommend this.↩︎

  5. The reason this happens is that your graphs are saved as separate image files within the same folder as the .qmd file. When you look at the output .html file, it will be in the same folder as those, which means that it will find all the images and hence display all the graphs. As soon as you move the .html file, for example to hand it in on Quercus, the image files don’t go with it, and when you look at the downloaded version of what you uploaded to Quercus, the graphs will all be missing. If this happens to you, you can catch it by looking at the file you handed in, seeing the problem, and fixing it before it’s too late. The embed-resources puts all the images directly in the .html file, so that they will go along with it no matter where you move it to.↩︎