── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Questions are below (there is only one this time). My solutions are below all the question parts for a question; scroll down if you get stuck.
For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!
1 Student stats
A survey of students at the University of California Davis asked respondents:
the gender they identify as (called Sex in the data set)
hours of TV they watched in the last week
hours they spent doing non-academic things on a computer in the last week (playing games, on social media etc. This survey was taken some years ago, and so pre-dates smartphones.)
how many hours of sleep they got last night
where they prefer to sit in their classes (front, middle, back of classroom)
how many alcoholic drinks they consumed in the last week (defined as US standard drinks)
their height (inches)
their mother’s height (inches)
their father’s height (inches)
the number of hours of exercise they got in the last week
their GPA
their area of study, classified as Liberal Arts or NonLib (“not arts” meaning science, math, engineering etc).
Students were allowed to not respond to any of the items, so there are some missing values, labelled NA in the dataframe you are about to read in.
Find the mean and standard deviation of sleep times (for all the students taken together).
For the students that sit in each part of the classroom (separately), find the number of students and their mean sleep time. (Hint: the column is called Seat with a capital S; the S being uppercase matters.)
For this data set, display how much time the students have spent watching TV and on the computer. (Display only these two columns.)
Several of the columns are heights. Display all of these columns, without naming any columns.
Display only the students who sit at the back of the classroom.
Display all the students that either slept 5 hours or less or watched over 30 hours of TV (or both).
For the students who are 70 or more inches tall, what are the mean heights of their mother and their father? Hint: the first time you do this, your answers will probably be missing. Why is that? Look at my solution to see how to get the answers you were expecting.
Rows: 173 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Sex, Seat, class
dbl (9): TV, computer, Sleep, alcohol, Height, momheight, dadheight, exercis...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
davis
Give the dataframe a suitable name. I called it davis, because it is a bunch of information about students at University of California-Davis.
\(\blacksquare\)
Find the mean and standard deviation of sleep times (for all the students taken together).
Solution
A summarize. Use your name for the dataframe, if it is different from mine:
davis %>%summarize(mean_sleep =mean(Sleep), sd_sleep =sd(Sleep))
You can use whatever names you like for the summaries. My recent habit is to use longer and more descriptive names for the summaries than I did in the past (like, when I wrote the lecture notes).
\(\blacksquare\)
For the students that sit in each part of the classroom (separately), find the number of students and their mean sleep time. (Hint: the column is called Seat with a capital S; the S being uppercase matters.)
Solution
This is a group-by and summarize, grouping by the categorical variable Seat first, and then computing the required summaries. We are finding something else other than the counts (the mean sleep time), so we have to use the n() idea here:
There were two students who didn’t say where they liked to sit, so their Seat is missing.
\(\blacksquare\)
For this data set, display how much time the students have spent watching TV and on the computer.
Solution
These two things are columns in the dataframe, and choosing columns is select:
davis %>%select(TV, computer)
It is rather annoying that the column computer has a lowercase C, and this matters as far as R is concerned.1
\(\blacksquare\)
Several of the columns are heights. Display all of these columns, without naming any columns.
Solution
The first thing is to find out exactly what it is that these height column names have in common. They are Height, momheight, and dadheight. (They are the student’s height and the heights of their parents.) What they have in common is that they end with height:
davis %>%select(ends_with("height"))
Given what we were just saying about uppercase and lowercase, you might be surprised that this worked: after all, the one that is the student’s height starts with an uppercase H, which does not match what I put in the ends_width. It turns out that select-helpers are not case-sensitive, unless you make them so:
davis %>%select(ends_with("height", ignore.case =FALSE))
The option is the rather cumbersome double negative “don’t ignore case”, ie. pay attention to the case. This gets only the two columns whose names literally end in height, not the student’s height whose name is (ends with) Height with a capital H, and this no longer matches because we are paying attention to uppercase/lowercase.
Another way to get all three columns, if you are worried about upper and lower case, is to match on the ends of the names that are the same including case:
davis %>%select(ends_with("eight"))
Fortunately, there are no columns ending in weight here, so this gets only the columns we want.
Or you could say that these three columns are the only ones that containheight, which is also true: