Worksheet 3
Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.
If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.
Questions are respectively on choosing rows and columns, joins, and the one-sample \(t\). There is a lot here, so you should plan to work at a good pace. You don’t have to do these three questions in order. If you want practice on a certain thing, start with that one.
Student stats
A survey of students at the University of California Davis asked respondents:
- the gender they identify as (called
Sex
in the data set) - hours of TV they watched in the last week
- hours they spent doing non-academic things on a computer in the last week (playing games, on social media etc. This survey was taken some years ago, and so pre-dates smartphones.)
- how many hours of sleep they got last night
- where they prefer to sit in their classes (front, middle, back of classroom)
- how many alcoholic drinks they consumed in the last week (defined as US standard drinks)
- their height (inches)
- their mother’s height (inches)
- their father’s height (inches)
- the number of hours of exercise they got in the last week
- their GPA
- their area of study, classified as Liberal Arts or
NonLib
(“not arts” meaning science, math, engineering etc).
Students were allowed to not respond to any of the items, so there are some missing values, labelled NA
in the dataframe you are about to read in.
The data are in http://ritsokiguess.site/datafiles/UCDavis1.csv.
- Read in and display (some of) the data.
- Find the mean and standard deviation of sleep times (for all the students taken together).
- For the students that sit in each part of the classroom (separately), find the number of students and their mean sleep time. (Hint: the column is called
Seat
with a capital S; the S being uppercase matters.)
- For each of the students this data set, display how much time they have spent watching TV and on the computer.
- Several of the columns are heights. Display all of these columns, without naming any columns.
- Display only the students who sit at the back of the classroom.
- Display all the students that either slept 5 hours or less or watched over 30 hours of TV (or both).
- For the students who are 70 or more inches tall, what are the mean heights of their mother and their father? Hint: the first time you do this, your answers will probably be missing. Why is that? Hint: to get the answers you were expecting, search for the
na.rm
option. What does that do?
Counting seabirds
Each year, bird experts associated with the Kodiak National Wildlife Refuge in Alaska count the number of seabirds of different types on the water in each of four different bays in the area. This is done by drawing (on a map) a number of straight-line “transects”, then driving a boat along each transect and counting the number and type of birds visible within a certain distance of the boat.
Variables of interest are:
- the
Year
of observation - the
Transect
number Temp
: the temperatureObservCond
: visibility, from Average up to IdealBay
the name of the baybird
: an abbreviation for the species of bird observedcount
: how many of that type of bird were observed (in that year, bay, transect).
The data are in http://ritsokiguess.site/datafiles/seabird_long.csv.
- Read in and display some of the data.
- A data set containing the full bird names and their principal diet is in http://ritsokiguess.site/datafiles/bird_names.csv. Read in and display some of this data set.
- It is awkward to read the bird abbreviations in the first dataframe. Create and save a new dataframe that has the full bird names as well as the number that were observed and all the other information.
- Which three species of bird were seen the most often altogether?
- Make a graph that shows the trend in total counts of each bird species over time. Think about what would make the most appealing graph to understand the time trends.
Seniors and cellphones
A cellphone company is thinking about offering a discount to new senior customers (aged 65 and over), but first wants to know whether seniors differ in their usage of cellphone services. The company knows that, for all its current customers in a certain city, that the mean length of a voice call is 9.2 minutes, and wants to know whether its current senior customers have the same or a different average length. In a recent survey, the cellphone company contacted a large number of its current customers, and asked for (among other things) the customer’s age group and when they made their last call. The length of that call was determined from the company’s records. There were 200 seniors in the survey.
The data are in http://ritsokiguess.site/datafiles/senior_phone.csv. These are only the seniors.
- Read in and display (some of) the data.
- Find the mean and standard standard deviation of the call lengths.
- Why might you doubt, even without looking at a graph, that the call lengths will resemble a normal distribution in shape? Explain briefly. You might find it helpful to use the fact that
pnorm(z)
works out how much of a standard normal distribution is less than the valuez
.
- Draw an appropriate graph of these data. Were your suspicions about shape confirmed?
- Explain briefly why, nonetheless, using a \(t\)-test in this situation may be reasonable.
- Test whether the mean length of all seniors’ calls in this city could be the same as the overall mean length of all calls made on the company’s network in that city, or whether it is different. What do you conclude, in the context of the data?