library(tidyverse)
mtcars
Assignment 0
(This is the same thing as Worksheet 1.)
The questions here have solutions attached. Follow the solutions to see what to do, if you cannot otherwise guess.
It is very much worth your while to work through these problems, and the ones in future tutorial worksheets, because they will get you used to how R operates, and gain you some comfort in coding things. If you do not work through these problems now, any issues that you could have dealt with this week (with help available) will come back to bite you later, when you will have an assignment due. This is stress you would do well to be without.
If you don’t get to the end in tutorial, it’s a good idea to finish them on your own time this week, maybe after Thursday’s lecture in the case of the last question.
Using R Studio online
- Point your web browser at http://r.datatools.utoronto.ca. Click on the blue “Log In” button below “R Studio”. You might see a “CILogon” screen. If you do, make sure that it says University of Toronto near the bottom and click the green Log On. Log in with your UTorId and password, then wait for R Studio to start up.
This is about what you should see first, before you click Log In:
Click Log In. If you are not logged into your UTorID, you should then see this:
Click Log On, and log in with your UTorID and password. You will see a progress bar as things start up, and then you should see something like this:
This is R Studio, ready to go. (Sometimes you’ll see a rather forbidding Error message; if you do, click OK and carry on.)
If you are already logged in to something else on the same browser that uses your UTorID and password, you may come straight here without needing to log in again.
- Create a new Project for your work in this course. A good name for the project is the course code. See below for how to do this.
Select File and New Project to get this:
Click on New Directory (highlighted blue on mine). This will create a new folder to put your new project in, which is usually what you want to do. The idea is that a project is a container for a larger collection of work, such as all your assignments in this course. That brings you to this:
where you click on New Project (highlighted on mine), and:
Give your project a name, as I did. Then click Create Project. At this point, R Studio will be restarted in your new project. You can tell which project you are in by looking top right, and you’ll see the name of your project next to the R symbol:
- One last piece of testing: find the Console window (which is probably on the left). Click next to the blue >, and type
library(tidyverse)
. Press Enter.
It may think a bit, and then you’ll see something like this:
You may have noticed (especially if you pause while typing) that R Studio will offer you some suggestions. If what you want is listed among the suggestions, you can select it by using the up and down arrow if needed, then press Enter to complete what you are typing. (If the suggestion is not anything like what you wanted, you can make it go away by pressing the Esc key, and then carry on typing.)
Aside: I used to use a cloud R Studio called rstudio.cloud
. If you see or hear any references to that, it means the same thing as R Studio on r.datatools
or jupyter
. (You can still use rstudio.cloud
if you want; it used to be completely free, but now the free tier won’t last you very long; the utoronto.ca
link is free as long as you are at U of T.) I’m trying to get rid of references to R Studio Cloud as I see them, but I am bound to miss some, and in the lecture videos they are rather hard to find. This is another reason not to rely too much on the videos.
Now we can get down to some actual work.
Getting started
This question is to get you started using R.
- Start R Studio on
r.datatools
(or on your computer), in your course project that you created in the previous question.
If you just finished the previous question, you have nothing to do. If you shut down R Studio in between, start it up again. It often likes to put you back in the last project you were in (in which case you have nothing more to do), but if it doesn’t, look in the File menu and either Open Project or see if you can find your course project in Recent Projects.
You ought to see something like this:
There should be one thing on the left half, and at the top right it’ll say “Environment is empty”.
Extra: if you want to tweak things, select Tools (at the top of the screen) and from it Global Options, then click Appearance. You can make the text bigger or smaller via Editor Font Size, and choose a different colour scheme by picking one of the Editor Themes (which previews on the right). My favourite is Tomorrow Night Blue. Click Apply or OK when you have found something you like. (I spend a lot of time in R Studio, and I like having a dark background to be easier on my eyes, but I use a black-on-white theme in lecture so that you can see most easily what I am doing.)
- We’re going to do some stuff in R here, just to get used to it. First, make a Quarto document by selecting File, New File and Quarto Document.
In the first box that pops up, you’ll be invited to give your document a title. Make something up for now. Down at the bottom, it says “Use Markdown Visual Editor” with a box to the left. Check that box.
The first time, you might be invited to “install some packages” to make the document thing work.1 Let it do that by clicking Yes. After that, you’ll have this:
A couple of technical notes:
this should be in the top left pane of your R Studio now, with the Console below it.
At the top of the file, between the two lines with three hyphens (minus signs, whatever), is some information about the document, known in the jargon as a YAML block, any of which you can change:
the title is whatever title you gave your document
the
format
is what the output is going to be (in this case, HTML like a webpage, which is mostly what we’ll be using)you should be in a visual editor that looks like Notion or a bit like a Google doc (the default). There is also a Source editor which gives you more control, and shows that underlying the document is a thing called R Markdown (which is a code for writing documents). Switch between them by clicking on Source or Visual as appropriate at the top left of your document.
My document is called “My awesome title”, but the file in which the document lives is still untitled because I haven’t saved it yet. See right at the top.
- You can delete the template code below the YAML block now (that is, everything from the title “Quarto” to the end). Somewhere in the space opened up below the YAML block (it might say “Heading 2”, greyed out), type a /. This, like Notion, gives you a list of things to choose from to insert there. Pressing Enter will insert a “code chunk”, sometimes known as a “code cell”. We are going to use this in a moment.
Something like this:
The {r} at the top of the code chunk means that the code that will go in there will be R code (you can also have a Python code chunk, among others).
- On the line below the
{r}
, type these two lines of code into the chunk in the Quarto document:
library(tidyverse) mtcars
What this will do: get hold of a built-in data set with information about some different models of car, and display it.
In approximately five seconds, you’ll be demonstrating that for yourself.
- Run this command. To do that, look at the top right of your code chunk block (shaded in a slightly different colour). You should see a down arrow and a green “play button”. Click the play button. This will run the code, and show the output below the code chunk.
Here’s what I get (yours should be the same):
This is a rectangular array of rows and columns, with individuals (here, cars) in rows and variables in columns, known as a “dataframe”. When you display a dataframe in an Quarto document, you see 10 rows and as many columns as will fit on the screen. At the bottom, it says how many rows and columns there are altogether (here 32 rows and 11 columns), and which ones are being displayed.
You can see more rows by clicking on Next, and if there are more columns, as there are here, you’ll see a little arrow next to the rightmost column (as here next to am
) that you can click on to see more columns. Try it and see. Or if you want to go to a particular collection of rows, click one of the numbers between Previous and Next: 1 is rows 1–10, 2 is rows 11–20, and so on.
The column on the left without a header (containing the names of the cars) is called “row names”. These have a funny kind of status, kind of a column and kind of not a column; usually, if we need to use the names, we have to put them in a column first.
In future solutions, rather than showing you a screenshot, expect me to show you something like this:
The top bit is the code, the bottom bit the output. In this kind of display, you only see the first ten rows (by default).2
If you don’t see the “play button”, make sure that what you have really is a code chunk. If you can’t figure it out, delete this code chunk and make a new one. Sometimes R Studio gets confused.
On the code chunk, the other symbols are the settings for this chunk (you have the choice to display or not display the code or the output or to not actually run the code). The second one, the down arrow, runs all the chunks prior to this one (but not this one).
Your output has its own little buttons (as seen on the screenshot). The first one pops the output out into its own window; the second one shows or hides the output, and the third one deletes the output (so that you have to run the chunk again to get it back). Experiment. You can’t do much damage here.
- Something a little more interesting:
summary
obtains a summary of whatever you feed it (the five-number summary plus the mean for numerical variables). Obtain this for our data frame. To do this, create a new code chunk below the previous one, typesummary(mtcars)
into the code chunk, and run it.
This is what you should see:
or the other way:
summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
For the gas mileage column mpg
, the mean is bigger than the median, and the largest value is unusually large compared with the others, suggesting a distribution that is skewed to the right.
There are 11 numeric (quantitative) variables, so we get the five-number summary plus mean for each one. Categorical variables, if we had any here, would be displayed a different way.
- Let’s make a histogram of the gas mileage data. Type the code below into another new code chunk, and run it:
ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 8)
The code looks a bit wordy, but we’ll see what all those pieces do later in the course (like, maybe tomorrow).
Solution
This is what you should see:
ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 8)
The long right tail supports our guess from before that the distribution is right-skewed.
- Some aesthetics: Add some narrative text above and below your code chunks. Above the code chunk is where you say what you are going to do (and maybe why you are doing it), and below is where you say what you conclude from the output you just obtained. I find it looks better if you have a blank line above and below each code chunk.
This is what I wrote (screenshot), with none of the code run yet. My library(tidyverse)
line seems to have disappeared, but yours should still be there:
- Save your Quarto document (the usual way with File and Save). This saves it on the jupyter servers (and not on your computer). This means that when you come back to it later, even from another device, this notebook will still be available to you. (This of course does not apply if you are running R Studio on your own computer.) Now click Render. This produces a pretty HTML version of your Quarto document. This will appear in a new tab of your web browser,3 which you might need to encourage to appear (if you have a pop-up blocker) by clicking a Try Again.
If there are any errors in the rendering process, these will appear in the Render tab. The error message will tell you where in your document your error was. Find it and correct it.4 Otherwise, you should see your document.
- The rendering process as you did it doesn’t produce that nice display of a dataframe that I had in one of my screenshots. To get that, alter the YAML block (at the very top) to read as below. Re-render, and note what it does.
format:
html:
df-print: paged embed-resources: true
You should keep anything else you had there before (such as a title), but rearrange the format-html part to look like this.
Note now that anyone reading your document can actually page through the dataframes you display in the same way that you did, to check that they contain the right things.
You should have this in the YAML block at the top of each assignment you do, so that the grader can check that your dataframes are what you say they are, and for another reason that I explain below.
Extra 1: you might prefer to have a preview of your document within R Studio. To make this happen if it doesn’t by itself, look for the gear wheel to the right of Render. Click the arrow beside it, and in the drop-down, click on Preview in Viewer Pane. Render again, and you’ll see the rendered version of your document in a Viewer pane on the right. This puts the thing you’re writing and what it will look like side by side.
Extra 2: you might be annoyed at having to remember to save things. If you are, you can enable auto-saving. To do this, go to Tools and select Global Options. Select Code (on the left) and Saving (at the top). Click on Automatically Save when editor loses focus, to put a check mark in the box on the left of it. Change the pull-down below that to Save and Write Changes. Click OK. Now, as soon as you pause for a couple of seconds, everything unsaved will be saved. This does not apply to “untitled” documents, since R Studio has no idea where to save them until you give the file a name. (This is a good reason to give your files names and save them once as soon as possible.)
- Practice handing in your rendered Quarto document, as if it were an assignment that was worth something. (It is good to get the practice in a low-stakes situation, so that you’ll know what to do next week.)
See below for what to do if you are running R Studio on your computer, rather than on r.datatools
.
There are two steps: download the HTML file onto your computer, and then handing it in on Quercus. To download: find the HTML file that you want to download in the Files pane on the right. You might need to click on Files at the top, especially if you had a Viewer open there before:
I called my Quarto document awesome
and the file I was working on was called awesome.qmd
(the extension stands for “Quarto Markdown”). That’s the file I had to render to produce the output. My output file itself is called awesome.html.
That’s the file I want to hand in. If you called your file something different when you saved it, that’s the thing to look for: there should be something ending in .qmd
and something with the same first part ending in .html
.
Click the checkbox to the left of the HTML file. Now click on More above the bottom-right pane. This pops up a menu from which you choose Export. This will pop up another window called Export Files, where you put the name that the file will have on your computer. (I usually leave the name the same.) Click Download. The file will go to your Downloads folder, or wherever things you download off the web go.
If you are working on your computer, you will have created a new folder when you created your project, and the html
file you made by rendering will be in that folder. Thus, you can hand it in (below) directly without having to download it first.
Now, to hand it in. Open up Quercus at q.utoronto.ca
, log in and navigate to this course. Click Assignments. Click (the title of) Assignment 0. There is a big blue Start Assignment button top right. Click it. You’ll get a File Upload at the bottom of the screen. Click Choose File and find the HTML file that you downloaded. Click Open (or equivalent on your system). The name of the file should appear next to Choose File. Click Submit Assignment. You’ll see Submitted at the top right, and below that is a Submission Details window and the file you uploaded.
You should be in the habit of always checking what you hand in, by downloading it again, to a different folder (like Downloads) and looking at it to make sure it’s what you thought you had handed in. One reason for doing so is that if you don’t have the line
embed-resources: true
in your YAML header, you will find that the file you handed in has lost all its graphs,5 which is of course a big problem on an assignment, and will cost you a bunch of marks if you do it (so make sure that the embed-resources
line is there).
If you want to try this again, you can try again as many times as you like, by making a New Attempt. (For the real thing, you can use this if you realize you made a mistake in something you submitted. The graders’ instructions, for the real thing, are to grade the last file submitted, so in that case you need to make sure that the last thing submitted before the due date includes everything that you want graded. My assignments have unlimited attempts, so you don’t have to ask me for another one.
- Something more ambitious: make a scatterplot of gas mileage
mpg
, on the \(y\) axis, against horsepower,hp
, on the \(x\)-axis.
That goes like this. I’ll explain the steps below.
library(tidyverse)
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
This shows a somewhat downward trend, which is what you’d expect, since a larger hp
value means a more powerful engine, which will probably consume more gas and get fewer miles per gallon. As for the code: to make a ggplot
plot, as we will shortly see in class, you first need a ggplot
statement that says what to plot. The first thing in a ggplot
is a data frame (mtcars
here), and then the aes
says that the plot will have hp
on the \(x\)-axis and mpg
on the \(y\)-axis, taken from the data frame that you specified. That’s all of the what-to-plot. The last thing is how to plot it; geom_point()
says to plot the data values as points.
You might like to add a regression line to the plot. That is a matter of adding this to the end of the plotting command:
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'
The line definitely goes downhill. Decide for yourself how well you think a line fits these data.
Reading data from a file
In this question, we read a file from the web and do some descriptive statistics and a graph. This is very like what you will be doing on future assignments, so it’s good to practice it now.
Take a look at the data file at http://ritsokiguess.site/datafiles/jumping.txt. These are measurements on 30 rats that were randomly made to do different amounts of jumping by group (we’ll see the details later in the course). The control group did no jumping, and the other groups did “low jumping” and “high jumping”. The first column says which jumping group each rat was in, and the second is the rat’s bone density (the experimenters’ supposition was that more jumping should go with higher bone density).
- What are the two columns of data separated by? (The fancy word is “delimited”).
Exactly one space. This is true all the way down, as you can check.
- Make a new Quarto document. Leave the YAML block, but get rid of the rest of the template document. Start with a code chunk containing
library(tidyverse)
. Run it.
You will get either the same message as before or nothing. (I got nothing because I had already loaded the tidyverse
in this session.)
- Put the URL of the data file in a variable called
my_url
. Then useread_delim
to read in the file. (See solutions for how.)read_delim
reads data files where the data values are always separated by the same single character, here a space. Save the data frame in a variablerats
.
Like this:
<- "http://ritsokiguess.site/datafiles/jumping.txt"
my_url <- read_delim(my_url," ") rats
Rows: 30 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): group
dbl (1): density
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The second thing in read_delim
is the thing that separates the data values. Often when you use read_delim
it’ll be a space.
Hint: to get the file name into my_url
, the best way is to right-click on the link, and select Copy Link Address (or equivalent in your browser). That puts in on your clipboard. Then make a code chunk and put this in it (you’ll probably only need to type one quote symbol, because R Studio will supply the other one):
<- "" my_url
then put the cursor between the two quote symbols and paste. This is better than selecting the URL in my text and then copy-pasting that because odd things happen if it happens to span two lines on your screen. (URLs tend to be rather long, so this is not impossible.)
- Take a look at your data frame, by making a new code chunk and putting the data frame’s name in it (as we did with
mtcars
).
rats
There are 30 rows and two columns, as there should be.
- Find the mean bone density for rats that did each amount of jumping.
This is something you’ll see a lot: group_by
followed by summarize
. Reminder: to get that funny thing with the percent signs (called the “pipe symbol”), type control-shift-M (or equivalent on a Mac):
%>% group_by(group) %>%
rats summarize(m = mean(density))
The mean bone density is clearly highest for the high jumping group, and not much different between the low-jumping and control groups.
- Make a boxplot of bone density for each jumping group.
On a boxplot, the groups go across and the values go up and down, so the right syntax is this:
ggplot(rats, aes(x=group, y=density)) + geom_boxplot()
Given the amount of variability, the control and low-jump groups are very similar (with the control group having a couple of outliers), but the high-jump group seems to have a consistently higher bone density than the others.
This is more or less in line with what the experimenters were guessing, but it seems that it has to be high jumping to make a difference.
You might recognize that this is the kind of data where we would use analysis of variance, which we will do later on in the course: we are comparing several (here three) groups.
Footnotes
Especially if you are on your own computer.↩︎
This document was actually produced by literally running this code, a process known as “rendering”, which we will learn about shortly.↩︎
Or possibly in the Viewer tab of R Studio, depending on how things are set up.↩︎
A big part of coding is dealing with errors. You will forget things, and it is fine. In the same way that it doesn’t matter how many times you get knocked down, it’s key that you get up again each time: it doesn’t matter how many errors you made, it’s key that you fix them. If you want something to sing along with while you do this, I recommend this.↩︎
The reason this happens is that your graphs are saved as separate image files within the same folder as the
.qmd
file. When you look at the output.html
file, it will be in the same folder as those, which means that it will find all the images and hence display all the graphs. As soon as you move the.html
file, for example to hand it in on Quercus, the image files don’t go with it, and when you look at the downloaded version of what you uploaded to Quercus, the graphs will all be missing. If this happens to you, you can catch it by looking at the file you handed in, seeing the problem, and fixing it before it’s too late. Theembed-resources
puts all the images directly in the.html
file, so that they will go along with it no matter where you move it to.↩︎