STAC32 Assignment 1

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Walking the dogs

The author of a statistics textbook owns dogs, and likes to walk (sometimes with the dogs, sometimes without). The author carries a simple phone app that measures the following quantities, except for Walk that was added by the author:

  • StepCount: Number of steps taken in the day
  • Kcal: Calories burned (according to pedometer)
  • Miles: Total distance walked (in miles)
  • Weather: classified as cold, rain, or shine (sunny).
  • Day: Day of week (as you would guess, except R=Thursday, U=Sunday)
  • Walk: Were the dogs walked? (1=yes or 0=no)
  • Steps: Steps in units of 1,000 (so StepCount/1000)

The data are in http://ritsokiguess.site/datafiles/Stat2Data_WalkTheDogs.csv.

  1. (3 points) Read (into R) the datafile and display at least some of the dataframe. (Here, and elsewhere in the course, “display at least some of the dataframe” means to display at least 10 rows and as many columns as will fit in your display.)

This is a CSV file (from the filename extension), so read_csv will read it in directly from the URL.1 I like to put the (usually long) URL in a variable first. The safe way to copy-paste the URL is to right-click it and then select Copy; if you select it first, you risk getting tripped up by URLs that span two lines on your screen, which might gain an extra line-break character that you don’t want.

Hence:

my_url <- 'http://ritsokiguess.site/datafiles/Stat2Data_WalkTheDogs.csv'
dogs <- read_csv(my_url)  
Rows: 223 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Weather, Day
dbl (5): StepCount, Kcal, Miles, Walk, Steps

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dogs

Give the dataframe a name that says something about what is in it, and does not duplicate any of the column names. Hence my dogs, or you could use something like walking. Two rather gimme points for reading in the dataframe successfully and giving it a good name (minus a half point if you use a name like my_data that does not say what it contains). One point for displaying it; just typing the name is enough. If you want, you can do something like this:

glimpse(dogs)
Rows: 223
Columns: 7
$ StepCount <dbl> 2615, 3323, 2721, 2454, 5528, 3257, 4988, 4497, 4567, 2569, …
$ Kcal      <dbl> 8, 12, 13, 12, 152, 17, 65, 133, 35, 19, 273, 353, 140, 4, 2…
$ Miles     <dbl> 1.4, 1.8, 1.4, 1.3, 3.1, 1.8, 2.7, 2.4, 2.6, 1.3, 5.1, 6.3, …
$ Weather   <chr> "shine", "shine", "shine", "shine", "cold", "shine", "shine"…
$ Day       <chr> "F", "S", "U", "M", "T", "W", "R", "F", "S", "U", "M", "T", …
$ Walk      <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Steps     <dbl> 2.615, 3.323, 2.721, 2.454, 5.528, 3.257, 4.988, 4.497, 4.56…

which shows at least some of the data values; that is also good. The point in displaying what you have read in is to convince yourself (and, of course, the grader) that the data you read in is at least plausible (for example, numbers where there should be numbers, text where there should be text, and the right kind of values).

  1. (3 points) Make a graph that shows how many times each type of weather was observed. Which type of weather was most common? Explain briefly.

This is the column Weather (note the uppercase W). It is a categorical variable, so the right graph is a bar chart:

ggplot(dogs, aes(x = Weather)) + geom_bar()

Two points for the graph, rather generous, but it’s your first ggplot graph, so I want to offer some encouragement.

The other point is for the explanation. The tallest bar is for shine, so sunny weather was the most common. (Have an answer in proper English that your reader can understand, and a reason for it. “The most common weather was shine” is not proper English.)

  1. (2 points) Make a graph that shows the distribution of distance walked each day.

One quantitative variable, Miles, so a histogram:

ggplot(dogs, aes(x = Miles)) + geom_histogram(bins = 8)

Pick a sensible number of bins. You want a number that shows a nice smooth shape; too few bins will hide the shape of the distribution:

ggplot(dogs, aes(x = Miles)) + geom_histogram(bins = 4)

and too many will be too jagged to show the overall picture:

ggplot(dogs, aes(x = Miles)) + geom_histogram(bins = 15)

Here, the up-and-down pattern is not a feature of the distribution, but just chance, and you don’t want to confuse your reader.

For me, a number of bins between 6 and 12 inclusive shows the shape clearly: a unimodal (one peak) distribution that is somewhat right-skewed with an upper outlier.

  1. (3 points) Does the person walk further on average on a day when they walked the dogs, as compared to when they did not? Make a suitable graph that will help you answer this question. What do you conclude? (Hint: for the column that indicates whether they walked the dogs, should this be quantitative or categorical? What will R treat this as? Your first attempt at a graph may not come out right. Read the message carefully to figure out what to do.)

This uses the quantitative variable Miles and the variable Walk that is actually categorical (either they do walk the dogs or they don’t), and hence a boxplot. This was my first attempt:

ggplot(dogs, aes(x = Walk, y = Miles)) + geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

This is where you need to stop and think. The boxplot should have two boxes: one for days when the dogs were walked, and one for days when they were not. Thus Walk should be categorical, but it is actually a number (0 or 1), so R thinks it is quantitative. How do we change R’s mind? The message says “Warning: Continuous x aesthetic: did you forget aes(group = ...)?” So ggplot thinks that Walk (the x) is actually something on a continuous scale, where anything between 0 and 1 is possible, as indicated by the values on the \(x\)-axis of my graph. To fix it, use the hint in the second part of the message,2 and add a group = to the aes, with the thing on the right side of the equals being the thing that defines the groups, namely Walk:

ggplot(dogs, aes(x = Walk, y = Miles, group = Walk)) + geom_boxplot()

It looks a bit odd to have Walk appear twice, but this is how to make it work.

Two points for getting to that. Only one (which is generous) for getting to my first boxplot and stopping there, because you need to think about whether it looks like the right thing; ten seconds of thinking will convince you that it does not and you need to fix it up somehow. That’s why the Hint is there.

Going back to the description of the data, a Walk value of 1 means that the author did walk the dogs. The median line across the right-hand boxplot is slightly higher, so the median distance walked by the author on days when they walked the dogs is slightly greater than on days when they did not.

You need to have adverb like “slightly” or “a little” for the best answer because the difference is not very big, relative to how much variability there is (a fair bit, especially on days when the author did not walk the dogs). Another way to say this is that the author fairly often walked further on days with no dog walk than on days when they did walk the dogs.

Extra: I don’t think you’ve seen this approach yet in the course (you will), but there is also a way to force R to treat an apparently-quantitative variable as categorical, which is to wrap it in factor():

ggplot(dogs, aes(x = factor(Walk), y = Miles)) + geom_boxplot()

That gets more or less the same graph as the one with group =. Note the difference on the \(x\)-axis, though: here, Walk is treated as a categorical value with the two categories 0 and 1 (that is to say, ignore the fact that they are actually numbers). On the group = graph above, Walk is treated as a variable for which a value like 0.5 actually makes sense, but a separate box is produced for each distinct value of Walk, whatever they happen to be.

  1. (2 points) Make a graph that shows how many times the author walked on each day of the week (without considering any of the other variables). What is wrong with your graph? Explain briefly. (If your graph does not have anything wrong with it, explain briefly how you fixed the problem you encountered.)

A bar chart again:

ggplot(dogs, aes(x = Day)) + geom_bar()

Only one point for the graph this time.

What’s wrong with it? Go back and look at the description of the days of the week. The days are in the wrong order! Specifically, they are in alphabetical order, rather than day-of-week order. The other point.

Extra: R has no way to know what the “right” order is unless you tell it in some way. Its default ordering for categories is to put them in alphabetical order, which is fine for “nominal” categorical variables where the order doesn’t matter. But day of week is “ordinal”: the order has meaning. We can fix this here by going back and looking at the data: the author recorded their walking stats every day in order, so the days of the week are in a sensible order in the data. We can take advantage of this by using fct_inorder thus:

ggplot(dogs, aes(x = fct_inorder(Day))) + geom_bar()

This says “order the days by the order you find them in the data”. The author started recording on a Friday, so the week here starts on a Friday (which is a bit odd), but the days are otherwise in a sensible order. This graph shows that the author’s walking tends to increase through the week to a maximum on Saturday, and then Sunday is a minimum.

  1. (3 points) Is there a relationship between the number of steps taken in a day (explanatory) and the distance travelled in that day (response)? Make a graph to investigate. What do you learn from your graph? Based on what you can find out about pedometer apps, does your graph make sense? Explain briefly. (Cite a source for your answer.)

Three things, a point each. First, the graph. With two quantitative variables StepCount and Miles, the right graph is a scatterplot, with (as I said in the question) Miles as the \(y\):

ggplot(dogs, aes(x = StepCount, Miles)) + geom_point()

You could use Steps instead of StepCount; the graph should be the same except for a change of scale on the \(x\)-axis.

Second, note that this is (except for one point) an apparently perfect linear relationship (words like “very strong” or “extremely strong” are also good, but you need to get at the idea that this is more than just a “strong” relationship, which could have a lot more scatter than this).

Third, figure out why this relationship is so strong. I searched for “how does a pedometer app work” (using the word I gave you in the question) and got this link: https://www.explainthatstuff.com/how-pedometers-work.html, which has a lot of detail. Old-fashioned mechanical or electronic pedometers (from the days before phones) worked by counting your steps. You had to calibrate these by walking a known distance, from which the pedometer worked out the length of your stride from the number of steps you took, and then estimated the distance you walked later as the number of steps times the calibrated stride length.

Phones contain an “accelerometer” (see https://www.makeuseof.com/how-does-my-phone-track-my-steps/) that can tell from your movement how many steps you are taking. Hence using a phone to count your steps is straightforward, and is how basic pedometer apps work.

(Something more sophisticated like Strava uses GPS to find your position on the earth’s surface every few seconds, and then based on its accurate measurements of your latitude and longitude works out how far you travelled between each time it recorded your position. This sort of app will also tell you how fast you are going. But this is not the kind of thing we are considering here.)

If you are foolish enough to use something like chatgpt for this, you will almost certainly get a source that doesn’t exist, and if you hand that in, you are guilty of plagiarism. As soon as the grader recognizes “your” text as the sort of thing that chatgpt produces, you can expect that they will check your source, and if it doesn’t exist, you will be on the wrong end of a slam-dunk academic integrity case.

The key thing here is that simple pedometer apps estimate the distance you walk by counting the number of steps and multiplying it by something (based on your stride length), so that according to these apps there is a perfect linear relationship between StepCount and Miles by definition. That’s where you need to get to for the third point.3

Extra: to find out what the linear relationship is, we can run a regression (which we see in more detail later in the course):

Miles.1 <- lm(Miles ~ StepCount, data = dogs)
summary(Miles.1)

Call:
lm(formula = Miles ~ StepCount, data = dogs)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.49115 -0.02432  0.00592  0.02727  0.10333 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.800e-02  7.463e-03  -13.13   <2e-16 ***
StepCount    5.681e-04  1.148e-06  495.05   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04931 on 221 degrees of freedom
Multiple R-squared:  0.9991,    Adjusted R-squared:  0.9991 
F-statistic: 2.451e+05 on 1 and 221 DF,  p-value: < 2.2e-16

This is as good as perfect (R-squared is very close to 1). But we said that Miles should be StepCount times something, and therefore the intercept should be exactly zero, which it is almost, but not quite. This is one of the few situations in which taking the intercept out of a regression is useful (we have a theoretical reason why the intercept should be zero), which we can do as below. In R, the intercept goes by the name 1, and a minus takes something out that was in the model:

Miles.2 <- lm(Miles ~ StepCount - 1, data = dogs)
summary(Miles.2)

Call:
lm(formula = Miles ~ StepCount - 1, data = dogs)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.57514 -0.05019 -0.01670  0.01580  0.11607 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
StepCount 5.546e-04  6.760e-07   820.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06565 on 222 degrees of freedom
Multiple R-squared:  0.9997,    Adjusted R-squared:  0.9997 
F-statistic: 6.731e+05 on 1 and 222 DF,  p-value: < 2.2e-16

so 1,000 steps is a little over half a mile, at least for the author.

Reading files

  1. (2 points) Take a look at the file at http://ritsokiguess.site/datafiles/adhd-2.txt. There are four observations on a variable y for each of 24 individuals, measured at times 0, 15, 30, and 60 minutes after a treatment. Considering that you will want to read these data into a dataframe, what is the key issue that will help you decide how to do it?

The key thing is that the observations on each line are separated by one # sign (“hash”, “pound sign”,4 whatever you think it’s called.)

Hence, read_delim will work with # as a delimiter (the character that separates one observation from the next).

  1. (2 points) Hence, read in and display (some of) the data.

You’ve done the hard thinking; now put it to work. It’s not obvious what to call the dataframe (and I’m not going to be picky if you choose a “generic” name this time), but the observations are of something called y, so we can use that as the name of the dataframe. Or, note the filename and infer that these data are something to do with ADHD,5 and use that as the name of the dataframe:

my_url <- "http://ritsokiguess.site/datafiles/adhd-2.txt"
y <- read_delim(my_url, "#")
Rows: 24 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "#"
dbl (4): D0, D15, D30, D60

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
y

As promised. If you look in the R Console output, you’ll see that it does indeed say Delimiter: "#".

This is actually not one of those cases where read_delim can guess what the delimiter is:

y <- read_delim(my_url)
Error: Could not guess the delimiter.

Use `vroom(delim =)` to specify one explicitly.

You might think it’s pretty obvious what is separating one data value from the next, because you used your eyes in the previous part and saw what it was. The reason is this:

(this from the help for vroom, which is what read_delim uses under the hood.) Since our delimiter is not one of these more commonly used ones,6 we have to specify it explicitly.

Extra: in our data, each row is one person, but there are four observations for each person (values of y but at different times). The usual thing is to have one observation per person (which would enable you to do \(t\)-tests, ANOVA etc), but here you would expect different observations at different times on the same person to be correlated (for example, if one observation is high, you might expect the person from which it came to be a “high scorer” generally, and you’d expect other observations on the same person to be high as well). This is like matched pairs, but worse. With two observations on the same person, you can take differences and get one summary observation per person, but with four observations per person, you cannot do that, and you have to analyze the four observations together. What you have to do falls under the umbrella of “repeated measures”, which we look at in D29.

The data in http://ritsokiguess.site/datafiles/heart-rates.xlsx are a spreadsheet of heart rate values for a sample of male and female patients (as identified by the patients themselves). We want to read these data into a dataframe and make a graph.

  1. (2 points) When reading a spreadsheet into R, the file with the spreadsheet in it has to be “local”, that is, on the same machine that you are running R. R has a function download.file that takes two inputs: a URL and a name for the local version of the spreadsheet file. Run this to save a local copy of the spreadsheet. Hint: if you are running R on your own machine that is running Windows, you may need to add a third input mode = "wb" to download.file.)

This sort of thing:

my_url <- "http://ritsokiguess.site/datafiles/heart-rates.xlsx"
download.file(my_url, "heart-rates.xlsx")

I used the same name for the spreadsheet file. You’ll probably see some output that shows that it downloaded properly (for example, a number of bytes downloaded).

The reason you might need a mode = "wb" is because Windows is picky about the type of thing a file contains. It distinguishes between a file that contains only ordinary text (like the .txt and .csv files we have seen) and a file that needs to be opened in the right software (like an .xlsx or a .docx file, that need to be opened in Excel or Word in order to make any sense). Files of this last type are called “binary files”, which is the reason for the b in the mode = "wb".7 If you are running on a Mac, or on r.datatools (which actually runs on Linux), you don’t need to worry about this.

If you are working on your own computer, you should now be able to open the file you just downloaded (it’ll be in the same folder as your R project) using Excel or whatever you have, and verify that it did indeed get downloaded properly. If you are on r.datatools, you’ll have to download the spreadsheet file to your computer first (in the Files pane, More and Export, as you did for handing in Assignment 0) and then look at it on your computer.

  1. (2 points) Read the spreadsheet file that you downloaded into a dataframe, and display (some of) that dataframe. There is only one sheet in the workbook, and it is called Sheet1. The columns are called gender and heartrate.

Use read_excel from the readxl package:

library(readxl)
heart <- read_excel("heart-rates.xlsx", sheet = "Sheet1")
heart

If this gives you an error, the likely reason is that you need to install readxl (with install.packages("readxl")) first.

You’ll have to think about what to call the dataframe. Calling it by the same name as one of the columns in it is troublesome, or at the very least confusing. This is why I told you what the columns were called. Give the dataframe a name that (i) says something about what it contains, and (ii) doesn’t duplicate the name of any of the columns.

  1. (2 points) Make a suitable graph of the two columns in your dataframe.

This ought to be easy now: gender is categorical and heartrate is quantitative, so the right graph is a boxplot. This is a straightforward boxplot (unlike some others) because gender cannot be mistaken for a quantitative variable:

ggplot(heart, aes(x = gender, y = heartrate)) + geom_boxplot()

Interpretation of this is a bit messy: the median heart rates are close together (relative to the considerable amount of variability); the females’ heart rates are a lot more spread out than the males’ (compare the heights of the boxes), but the males have a low outlier that seems to be quite a lot lower than the others. So I didn’t ask you for an interpretation.

Extra: I wanted to give you something fairly easy to do with these data, and this was about the easiest thing I could think of, as a reward for your efforts at reading the data in. But, if you couldn’t figure out how to read the data in, you don’t get to reap this reward. A lot of real-life data analysis is like this: you have to go through a lot of effort to get the data into R and bash it into shape,8 but once you have done that, the actual analysis part is easy in comparison.

Footnotes

  1. If you click on the URL, it may open it in a spreadsheet program, or it may invite you to download it first and when you open it, it will open in a spreadsheet program. You don’t need to do that, but this is the hint that it’s a .csv file.↩︎

  2. Learning to read and act on messages is a key part of learning to code. If you read the message carefully, it will often tell you exactly what to do. In the old days, R’s error messages were hard to make sense of, but the Tidyverse people have put a lot of effort into making helpful messages. It is much better to read the message and figure out what it’s telling you than to jump immediately to Google or (worse) chatgpt because they don’t know what problem you are working on right now.↩︎

  3. I don’t know what happened for that one point that’s off the line. Possibly an app malfunction, or a mis-recording of the data.↩︎

  4. I was born in a country where the “pound” is the unit of currency, and this is not the symbol for it. The symbol is actually a stylized L, because the Latin word for “pound” was “libra”. The reason this symbol is sometimes called a “pound sign” (even though it is not) is that the shift-3 on a US keyboard that will get you a # will, on a UK keyboard, get you a real pound sign.↩︎

  5. I think I used these data for an assignment question some years ago, but I can’t find that question and hence I don’t have any more background than I gave you.↩︎

  6. \t is a special notation for “tab”.↩︎

  7. The w means “write”, as in: you are giving Windows the permission to write a new file, which is the downloaded copy of the spreadsheet.↩︎

  8. We have a whole section on tidying data later in the course.↩︎