Worksheet 3

Published

September 19, 2023

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Questions are below (there is only one this time). My solutions are below all the question parts for a question; scroll down if you get stuck.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 Student stats

A survey of students at the University of California Davis asked respondents:

the gender they identify as (called Sex in the data set)
hours of TV they watched in the last week
hours they spent doing non-academic things on a computer in the last week (playing games, on social media etc. This survey was taken some years ago, and so pre-dates smartphones.)
how many hours of sleep they got last night
where they prefer to sit in their classes (front, middle, back of classroom)
how many alcoholic drinks they consumed in the last week (defined as US standard drinks)
their height (inches)
their mother’s height (inches)
their father’s height (inches)
the number of hours of exercise they got in the last week
their GPA
their area of study, classified as Liberal Arts or NonLib (“not arts” meaning science, math, engineering etc).

Students were allowed to not respond to any of the items, so there are some missing values, labelled NA in the dataframe you are about to read in.

The data are in http://ritsokiguess.site/datafiles/UCDavis1.csv.

Read in and display (some of) the data.
Find the mean and standard deviation of sleep times (for all the students taken together).
For the students that sit in each part of the classroom (separately), find the number of students and their mean sleep time. (Hint: the column is called Seat with a capital S; the S being uppercase matters.)
For this data set, display how much time the students have spent watching TV and on the computer. (Display only these two columns.)
Several of the columns are heights. Display all of these columns, without naming any columns.
Display only the students who sit at the back of the classroom.
Display all the students that either slept 5 hours or less or watched over 30 hours of TV (or both).
For the students who are 70 or more inches tall, what are the mean heights of their mother and their father? Hint: the first time you do this, your answers will probably be missing. Why is that? Look at my solution to see how to get the answers you were expecting.

My solutions follow:

Read in and display (some of) the data.

Solution

As usual:

my_url <- "http://ritsokiguess.site/datafiles/UCDavis1.csv"
davis <- read_csv(my_url)

Rows: 173 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Sex, Seat, class
dbl (9): TV, computer, Sleep, alcohol, Height, momheight, dadheight, exercis...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

davis

Give the dataframe a suitable name. I called it davis, because it is a bunch of information about students at University of California-Davis.

\(\blacksquare\)

Find the mean and standard deviation of sleep times (for all the students taken together).

Solution

A summarize. Use your name for the dataframe, if it is different from mine:

davis %>% summarize(mean_sleep = mean(Sleep), sd_sleep = sd(Sleep))

You can use whatever names you like for the summaries. My recent habit is to use longer and more descriptive names for the summaries than I did in the past (like, when I wrote the lecture notes).

\(\blacksquare\)

For the students that sit in each part of the classroom (separately), find the number of students and their mean sleep time. (Hint: the column is called Seat with a capital S; the S being uppercase matters.)

Solution

This is a group-by and summarize, grouping by the categorical variable Seat first, and then computing the required summaries. We are finding something else other than the counts (the mean sleep time), so we have to use the n() idea here:

# summary(davis)
davis %>% 
  group_by(Seat) %>% 
  summarize(n_student = n(), mean_sleep = mean(Sleep))

There were two students who didn’t say where they liked to sit, so their Seat is missing.

\(\blacksquare\)

For this data set, display how much time the students have spent watching TV and on the computer.

Solution

These two things are columns in the dataframe, and choosing columns is select:

davis %>% select(TV, computer)

It is rather annoying that the column computer has a lowercase C, and this matters as far as R is concerned.¹

\(\blacksquare\)

Several of the columns are heights. Display all of these columns, without naming any columns.

Solution

The first thing is to find out exactly what it is that these height column names have in common. They are Height, momheight, and dadheight. (They are the student’s height and the heights of their parents.) What they have in common is that they end with height:

davis %>% select(ends_with("height"))

Given what we were just saying about uppercase and lowercase, you might be surprised that this worked: after all, the one that is the student’s height starts with an uppercase H, which does not match what I put in the ends_width. It turns out that select-helpers are not case-sensitive, unless you make them so:

davis %>% select(ends_with("height", ignore.case = FALSE))

The option is the rather cumbersome double negative “don’t ignore case”, ie. pay attention to the case. This gets only the two columns whose names literally end in height, not the student’s height whose name is (ends with) Height with a capital H, and this no longer matches because we are paying attention to uppercase/lowercase.

Another way to get all three columns, if you are worried about upper and lower case, is to match on the ends of the names that are the same including case:

davis %>% select(ends_with("eight"))

Fortunately, there are no columns ending in weight here, so this gets only the columns we want.

Or you could say that these three columns are the only ones that contain height, which is also true:

davis %>% select(contains("height"))

or if you really want to grapple with regular expressions, even this:

davis %>% select(matches("height$"))

Inside matches, my regular expression searches for columns that match with the letters height at the end of the column name, so that eg height_in_cm wouldn’t match.²

\(\blacksquare\)

Display only the students who sit at the back of the classroom.

Solution

This is choosing rows, specifically the rows for which Seat is Back, so it needs filter:

davis %>% filter(Seat == "Back")

Don’t forget the two equals signs, which is how R tests that something is equal to something else: “give me all the rows for which it is true that Seat is equal to Back”.

Confusing aside: Also, Back is a literal piece of text, so it has to be in quotes; filter(Seat == Back) would mean “give me all the rows for which the column Seat is equal to whatever is stored in the variable Back. Can you make sense of this?

Back <- "Front"
davis %>% filter(Seat == Back)

By the way, because this is so confusing, don’t do something like what I just did! Your work colleagues will hate it, and when you come back to your code in six months, you will hate yourself!

End of aside.

\(\blacksquare\)

Display all the students that either slept 5 hours or less or watched over 30 hours of TV (or both).

Solution

This is either/or, so needs |:

davis %>% filter(Sleep <= 5 | TV > 30)

To be precise, “5 hours or less” is less than or equal, and “over” is strictly greater than.

To convince yourself that you have the right ones, scroll through your results and say why each of the students is there. My first ten students are all there because they slept 5 hours or less (the 8th one also watched over 30 hours of TV). In the next ten, the first student got more than 5 hours of sleep but is there because they (claimed to have) watched 100 hours of TV. Students 24 and 25 got lots of sleep but also watched lots of TV. And so on.

\(\blacksquare\)

For the students who are 70 or more inches tall, what are the mean heights of their mother and their father? Hint: the first time you do this, your answers will probably be missing. Why is that? Look at my solution to see how to get the answers you were expecting.

Solution

In this kind of problem, where we want summaries for only some of the observations, the first step is to select the observations we want, and the second step is to summarize the observations we selected:

davis %>% filter(Height >= 70) %>% 
  summarize(mom_mean = mean(momheight), dad_mean = mean(dadheight))

OK, so those were both missing. To find out why, step back to the filter (run it without the summarize):

davis %>% filter(Height >= 70)

Page down almost to the end to see that the 40th student didn’t give an answer for their mother’s and father’s heights. As far as R is concerned, the mean of a set of numbers including a missing value is itself missing, on the basis that the missing value could be anything.

Here, it seems reasonable to calculate the means of the non-missing parental heights.³ There are a couple of ways to do that, which you probably haven’t seen before. The first includes the missing values, but removes them when calculating the mean:⁴

davis %>% filter(Height >= 70) %>% 
  summarize(mom_mean = mean(momheight, na.rm = TRUE), dad_mean = mean(dadheight, na.rm = TRUE))

The second way removes the missing momheights and dadheights first before trying to calculate the means:

davis %>% filter(Height >= 70) %>% 
  drop_na(momheight, dadheight) %>% 
  summarize(mom_mean = mean(momheight), dad_mean = mean(dadheight))

Same answer. If you run the first two lines of this (and not the third one), you’ll see that there are now only 41 students chosen, and the one with the missing momheight and dadheight is no longer there. So when you run mean, you will not run into any missing values.

\(\blacksquare\)

Extras (for reading later, unless you finished early):

(c):

There seems to be a tendency for students that sit nearer the front of the class to sleep less. This is a bit hard to read from this table, because the values of Seat come out in alphabetical order, rather than a more easily interpretable order like “front, middle, back”.⁵

Having seen this, a graph might give us a hint about whether this is a real trend or more likely just chance:

ggplot(davis, aes(x = Seat, y = Sleep)) + geom_boxplot()

The median sleep times (for the students with non-missing Seat) seem to be exactly equal, and there is a lot of variability, so the trend we saw is probably just chance.

(h):

Extra 1: when you feed drop_na more than one column, it removes any rows that contain missing values in any of the columns. For example, consider this dataframe:

What does this do?

d %>% drop_na(y, z)

There is only the first row left, because the second row has a missing value for y and the third has a missing value for z. The values 5 for z and 7 for y, even though they are good data values, are not included because the rows they were in had missing values for something else. The moral of this story is that if you use drop_na on several columns, you stand to lose quite a lot of your data. The default for drop_na is to drop rows with any missing values at all. The original davis dataframe had 173 rows, but if we drop all the rows with any missing values:

davis %>% drop_na()

we are down to 150 rows, having lost almost 15% of our data. Some people prefer to “impute” missing data, which is to replace missing values with an estimate based on the values of other variables. For example, a student with a missing Height whose parents are both tall is probably tall themselves.

Extra 2: on that note, you would expect taller students to have taller parents. How do the values we calculated above with the mean heights of all the parents?

davis %>% 
  summarize(mom_mean = mean(momheight, na.rm = TRUE), dad_mean = mean(dadheight, na.rm = TRUE))

An inch or two less.

Extra 3: You’ll remember group-by and summarize from numerical summaries. Is there a way to use that here? Sort of:

davis %>% 
  group_by(Height >= 70) %>% 
  summarize(n = n(), mom_mean = mean(momheight, na.rm = TRUE), dad_mean = mean(dadheight, na.rm = TRUE))

The answers in the TRUE line are the ones we got before. We are used to using group_by with a categorical column, which we don’t have for this problem, but in fact you can use group_by with anything that makes a categorical, including something that evaluates to TRUE or FALSE. The expression Height >= 70 evaluates to true or false or missing (there were two students who didn’t answer how tall they were). The first line of the result gives the mean parental heights for the students who were less than 70 inches all (most of them), and the last line gives the mean parental heights for the students who didn’t give their own heights.

Note that some of the parental heights were missing, even for students who gave their own heights, so I had to use the na.rm thing again (or I could have used drop_na as before, before calculating the means).

Footnotes

You could have another column called Computer in here as well, and R would treat them as different. But that means that in R, case matters.↩︎
Though height_in_cm, if we had such a column, would match contains.↩︎
You might want not to do that if being missing is itself informative about heights. The sort of thing I mean is if students with short parents are more likely not to report their parents’ heights. In that case, a mean of parental heights would be biased because you have too few small ones.↩︎
rm is shorthand for “remove”, and NA is R’s notation for “missing”. Thus the function option na.rm = TRUE means “it is true that you remove the missing values” before calculating the mean. rm comes from the Unix operating system, which lives on as Linux, and is a command for removing files.↩︎
R has no way to know what a logical order is for these data, unless we tell it.↩︎