Worksheet 3

Published

September 22, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Questions are respectively on choosing rows and columns, joins, and the one-sample $t$. There is a lot here, so you should plan to work at a good pace. You don’t have to do these three questions in order. If you want practice on a certain thing, start with that one.

Student stats

A survey of students at the University of California Davis asked respondents:

the gender they identify as (called Sex in the data set)
hours of TV they watched in the last week
hours they spent doing non-academic things on a computer in the last week (playing games, on social media etc. This survey was taken some years ago, and so pre-dates smartphones.)
how many hours of sleep they got last night
where they prefer to sit in their classes (front, middle, back of classroom)
how many alcoholic drinks they consumed in the last week (defined as US standard drinks)
their height (inches)
their mother’s height (inches)
their father’s height (inches)
the number of hours of exercise they got in the last week
their GPA
their area of study, classified as Liberal Arts or NonLib (“not arts” meaning science, math, engineering etc).

Students were allowed to not respond to any of the items, so there are some missing values, labelled NA in the dataframe you are about to read in.

The data are in http://ritsokiguess.site/datafiles/UCDavis1.csv.

Read in and display (some of) the data.

Solution

As usual:

my_url <- "http://ritsokiguess.site/datafiles/UCDavis1.csv"
davis <- read_csv(my_url)

Rows: 173 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Sex, Seat, class
dbl (9): TV, computer, Sleep, alcohol, Height, momheight, dadheight, exercis...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

davis

Give the dataframe a suitable name. I called it davis, because it is a bunch of information about students at University of California-Davis.

$\blacksquare$

Find the mean and standard deviation of sleep times (for all the students taken together).

Solution

A summarize. Use your name for the dataframe, if it is different from mine:

davis %>% summarize(mean_sleep = mean(Sleep), sd_sleep = sd(Sleep))

You can use whatever names you like for the summaries. My recent habit is to use longer and more descriptive names for the summaries than I did in the past (like, when I wrote the lecture notes).

$\blacksquare$

For the students that sit in each part of the classroom (separately), find the number of students and their mean sleep time. (Hint: the column is called Seat with a capital S; the S being uppercase matters.)

Solution

This is a group-by and summarize, grouping by the categorical variable Seat first, and then computing the required summaries. We are finding something else other than the counts (the mean sleep time), so we have to use the n() idea here:

# summary(davis)
davis %>% 
  group_by(Seat) %>% 
  summarize(n_student = n(), mean_sleep = mean(Sleep))

There were two students who didn’t say where they liked to sit, so their Seat is missing.

Extra:

There seems to be a tendency for students that sit nearer the front of the class to sleep less. This is a bit hard to read from this table, because the values of Seat come out in alphabetical order, rather than a more easily interpretable order like “front, middle, back”.¹

Having seen this, a graph might give us a hint about whether this is a real trend or more likely just chance:

ggplot(davis, aes(x = Seat, y = Sleep)) + geom_boxplot()

The median sleep times (for the students with non-missing Seat) seem to be exactly equal, and there is a lot of variability, so the trend we saw is probably just chance.

$\blacksquare$

For each of the students this data set, display how much time they have spent watching TV and on the computer.

Solution

These two things are columns in the dataframe, and choosing columns is select:

davis %>% select(TV, computer)

It is rather annoying that the column computer has a lowercase C, and this matters as far as R is concerned.²

$\blacksquare$

Several of the columns are heights. Display all of these columns, without naming any columns.

Solution

The first thing is to find out exactly what it is that these height column names have in common. They are Height, momheight, and dadheight. (They are the student’s height and the heights of their parents.) What they have in common is that they end with height:

davis %>% select(ends_with("height"))

Given what we were just saying about uppercase and lowercase, you might be surprised that this worked: after all, the one that is the student’s height starts with an uppercase H, which does not match what I put in the ends_width. It turns out that select-helpers are not case-sensitive, unless you make them so:

davis %>% select(ends_with("height", ignore.case = FALSE))

The option is the rather cumbersome double negative “don’t ignore case”, ie. pay attention to the case. This gets only the two columns whose names literally end in height, not the student’s height whose name is (ends with) Height with a capital H, and this no longer matches because we are paying attention to uppercase/lowercase.

Another way to get all three columns, if you are worried about upper and lower case, is to match on the ends of the names that are the same including case:

davis %>% select(ends_with("eight"))

Fortunately, there are no columns ending in weight here, so this gets only the columns we want.

Or you could say that these three columns are the only ones that contain height, which is also true:

davis %>% select(contains("height"))

or if you really want to grapple with regular expressions, even this:

davis %>% select(matches("height$"))

Inside matches, my regular expression searches for columns that match with the letters height at the end of the column name, so that eg height_in_cm wouldn’t match.³

$\blacksquare$

Display only the students who sit at the back of the classroom.

Solution

This is choosing rows, specifically the rows for which Seat is Back, so it needs filter:

davis %>% filter(Seat == "Back")

Don’t forget the two equals signs, which is how R tests that something is equal to something else: “give me all the rows for which it is true that Seat is equal to Back”.

Confusing aside: Also, Back is a literal piece of text, so it has to be in quotes; filter(Seat == Back) would mean “give me all the rows for which the column Seat is equal to whatever is stored in the variable Back. Can you make sense of this?

Back <- "Front"
davis %>% filter(Seat == Back)

By the way, because this is so confusing, don’t do something like what I just did! Your work colleagues will hate it, and when you come back to your code in six months, you will hate yourself!

End of confusing aside.

$\blacksquare$

Display all the students that either slept 5 hours or less or watched over 30 hours of TV (or both).

Solution

This is either/or, so needs |:

davis %>% filter(Sleep <= 5 | TV > 30)

To be precise, “5 hours or less” is less than or equal, and “over” is strictly greater than.

To convince yourself that you have the right ones, scroll through your results and say why each of the students is there. My first ten students are all there because they slept 5 hours or less (the 8th one also watched over 30 hours of TV). In the next ten, the first student got more than 5 hours of sleep but is there because they (claimed to have) watched 100 hours of TV. Students 24 and 25 got lots of sleep but also watched lots of TV. And so on.

$\blacksquare$

For the students who are 70 or more inches tall, what are the mean heights of their mother and their father? Hint: the first time you do this, your answers will probably be missing. Why is that? Hint: to get the answers you were expecting, search for the na.rm option. What does that do?

Solution

In this kind of problem, where we want summaries for only some of the observations, the first step is to select the observations we want, and the second step is to summarize the observations we selected:

davis %>% filter(Height >= 70) %>% 
  summarize(mom_mean = mean(momheight), dad_mean = mean(dadheight))

OK, so those were both missing. To find out why, step back to the filter (run it without the summarize):

davis %>% filter(Height >= 70)

Page down almost to the end to see that the 40th student didn’t give an answer for their mother’s and father’s heights. As far as R is concerned, the mean of a set of numbers including a missing value is itself missing, on the basis that the missing value could be anything.

Here, it seems reasonable to calculate the means of the non-missing parental heights.⁴ There are a couple of ways to do that, which you probably haven’t seen before. The first includes the missing values, but removes them when calculating the mean:⁵

davis %>% filter(Height >= 70) %>% 
  summarize(mom_mean = mean(momheight, na.rm = TRUE), dad_mean = mean(dadheight, na.rm = TRUE))

When searching for an R function or option, it usually helps to search with r first. When I searched for “r na.rm”, I got several hits, any of which would be good. For example, https://www.statology.org/na-rm/ says clearly how to calculate the mean without including any missing values, which is exactly what we want to do.

A second way removes the missing momheights and dadheights first before trying to calculate the means:

davis %>% filter(Height >= 70) %>% 
  drop_na(momheight, dadheight) %>% 
  summarize(mom_mean = mean(momheight), dad_mean = mean(dadheight))

To search for this kind of approach, you can try something like “r omit rows with missing values”. The third answer here gives the Tidyverse answer (the one you want). You’ll see a lot of discussion of the old-fashioned na.omit, but the fourth example here shows the “tidy” way to do it.

The answer, both ways, is the same. If you run the first two lines of this (and not the third one), you’ll see that there are now only 41 students chosen, and the one with the missing momheight and dadheight is no longer there. So when you run mean, you will not run into any missing values.

Extras:

Extra 1: when you feed drop_na more than one column, it removes any rows that contain missing values in any of the columns. For example, consider this dataframe:

What does this do?

d %>% drop_na(y, z)

There is only the first row left, because the second row has a missing value for y and the third has a missing value for z. The values 5 for z and 7 for y, even though they are good data values, are not included because the rows they were in had missing values for something else. The moral of this story is that if you use drop_na on several columns, you stand to lose quite a lot of your data. The default for drop_na is to drop rows with any missing values at all. The original davis dataframe had 173 rows, but if we drop all the rows with any missing values:

davis %>% drop_na()

we are down to 150 rows, having lost almost 15% of our data. Some people prefer to “impute” missing data, which is to replace missing values with an estimate based on the values of other variables. For example, a student with a missing Height whose parents are both tall is probably tall themselves.

Extra 2: on that note, you would expect taller students to have taller parents. How do the values we calculated above compare with the mean heights of all the parents?

davis %>% 
  summarize(mom_mean = mean(momheight, na.rm = TRUE), dad_mean = mean(dadheight, na.rm = TRUE))

An inch or two less.

Extra 3: You’ll remember group-by and summarize from numerical summaries. Is there a way to use that here? Sort of:

davis %>% 
  group_by(Height >= 70) %>% 
  summarize(n = n(), mom_mean = mean(momheight, na.rm = TRUE), dad_mean = mean(dadheight, na.rm = TRUE))

The answers in the TRUE line are the ones we got before. We are used to using group_by with a categorical column, which we don’t have for this problem, but in fact you can use group_by with anything that makes a categorical, including something that evaluates to TRUE or FALSE. The expression Height >= 70 evaluates to true or false or missing (there were two students who didn’t answer how tall they were). The first line of the result gives the mean parental heights for the students who were less than 70 inches all (most of them), and the last line gives the mean parental heights for the students who didn’t give their own heights.

Note that some of the parental heights were missing, even for students who gave their own heights, so I had to use the na.rm thing again (or I could have used drop_na as before, before calculating the means).

$\blacksquare$

Counting seabirds

Each year, bird experts associated with the Kodiak National Wildlife Refuge in Alaska count the number of seabirds of different types on the water in each of four different bays in the area. This is done by drawing (on a map) a number of straight-line “transects”, then driving a boat along each transect and counting the number and type of birds visible within a certain distance of the boat.

Variables of interest are:

the Year of observation
the Transect number
Temp: the temperature
ObservCond: visibility, from Average up to Ideal
Bay the name of the bay
bird: an abbreviation for the species of bird observed
count: how many of that type of bird were observed (in that year, bay, transect).

The data are in http://ritsokiguess.site/datafiles/seabird_long.csv.

Read in and display some of the data.

my_url <- "http://ritsokiguess.site/datafiles/seabird_long.csv"
seabird <- read_csv(my_url)

Rows: 56895 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ObservCond, Bay, ObservCondFactor3, bird
dbl (5): Year, Site, Transect, Temp, count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

seabird

There are over 56,000 observations, and the variables named (plus some others which we ignore). The pager displays a maximum of 1000 pages of 10 rows each, so you might think there are exactly 10,000 observations, but there are only this many displayed. The actual number of rows is at the top.

Extra: some more details:

Transect reference. This talks about measuring plant cover on land, but the transects in our data are on water (so you cannot use a measuring tape!). The idea is that by always using the same transects, measurements are consistent from year to year, and by having a set of transects that cover the study area, you should be observing most or all of what you are trying to observe.
The original paper that the data came from, with a lot more detail about how the observations were made. The aim of this study was to compare seabird numbers from one year to the next, so using the same transects every year was key. It was not important to estimate the number of seabirds in the entire wildlife refuge, but instead to see how the counts in the places you looked changed over time.

A data set containing the full bird names and their principal diet is in http://ritsokiguess.site/datafiles/bird_names.csv. Read in and display some of this data set.

The same again, really:

my_url <- "http://ritsokiguess.site/datafiles/bird_names.csv"
bird_names <- read_csv(my_url)

Rows: 15 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): bird, bird_name, diet

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bird_names

There are 15 different bird species observed. Some of them, like the Loon, have subspecies, but they are difficult to tell apart (at least to an observer on a boat looking through binoculars).

It is awkward to read the bird abbreviations in the first dataframe. Create and save a new dataframe that has the full bird names as well as the number that were observed and all the other information.

This is using the small dataframe of bird names as a lookup table (as we did with the product codes and product names in lecture). The strategy is to use the lookup table second in a left-join. Note that both dataframes have a column called bird with the bird abbreviations, and that is the only column they have in common. Hence there is no need for a join_by:

seabird %>% left_join(bird_names) -> birds_all

Joining with `by = join_by(bird)`

birds_all

Scroll across to see that you have a column containing the bird’s full name (in bird_name) and also its diet.

If you want to be clear about how the join is happening, you can specify the column to join by explicitly (note that if you don’t specify what you are joining by, left_join will tell you what it joined by, and you can make sure that this is correct):

seabird %>% left_join(bird_names, join_by("bird"))

but if you do this on an assignment, you ought to say why you have done it this way when you could have omitted the join_by. Wanting to be clear about what is happening is a good reason.

Which three species of bird were seen the most often altogether?

This is a group-by and summarize. Since we’re summing up over years, bays, and transects, we don’t need to specify any of them in the group-by: just group by species, and summarize by summing the counts. Use your dataframe with the full bird names in it. The last line shows the three rows of the result with the highest total count:

birds_all %>% 
  group_by(bird_name) %>% 
  summarize(total_count = sum(count)) %>% 
  slice_max(total_count, n = 3)

Common Murre, by a long way, then long-tailed duck and black scoter.

Another way to do this is to arrange the total counts in descending order, and pick off the top three.

Make a graph that shows the trend in total counts of each bird species over time. Think about what would make the most appealing graph to understand the time trends.

The first step is to work out the totals over species and year, which will make the raw material for our graph:

birds_all %>% group_by(bird_name, Year) %>% 
  summarize(total = sum(count)) -> d

`summarise()` has grouped output by 'bird_name'. You can override using the
`.groups` argument.

So now we have total and year (quantitative) and species (categorical). This suggests a scatterplot with the species indicated by colour. Joining the points by lines often works well on a time plot:

ggplot(d, aes(x = Year, y = total, colour = bird_name)) + geom_point() + geom_line()

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_line()`).

The problem here is that some of the birds are much more commonly seen than others. That is the Common Murre at the top, and pretty much everything else mixed up at the bottom, which makes it hard to see what is going on. Also, it’s hard to tell that many colours apart.

Two ideas, then:

instead of using colours for the species, how about using facets?
because some birds are much more commonly seen than others, give each facet its own scale:

ggplot(d, aes(x = Year, y = total)) + geom_point() + geom_line() +
  facet_wrap(~ bird_name, scales = "free")

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

and now you can see what is going on. There seem to be a few birds reaching a peak around 2000 or just before, and then dropping off, or maybe a bit of up-and-down cyclic behaviour, but there aren’t very many common themes. It’s probably best to talk about each bird species separately.

Seniors and cellphones

A cellphone company is thinking about offering a discount to new senior customers (aged 65 and over), but first wants to know whether seniors differ in their usage of cellphone services. The company knows that, for all its current customers in a certain city, that the mean length of a voice call is 9.2 minutes, and wants to know whether its current senior customers have the same or a different average length. In a recent survey, the cellphone company contacted a large number of its current customers, and asked for (among other things) the customer’s age group and when they made their last call. The length of that call was determined from the company’s records. There were 200 seniors in the survey.

The data are in http://ritsokiguess.site/datafiles/senior_phone.csv. These are only the seniors.

Read in and display (some of) the data.

my_url <- "http://ritsokiguess.site/datafiles/senior_phone.csv"
calls <- read_csv(my_url)

Rows: 200 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): call_length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

calls

One column, called call_length, containing the call lengths, evidently in whole numbers of minutes, and 200 rows, one for each senior customer.

Extra 1: if you were employed by the cellphone company, you would probably receive a spreadsheet containing lots of columns, maybe even the whole survey, with each customer’s phone number (or account number), and a lot of other columns containing responses to other questions on the survey. One of the columns would be age group, and then you would have to extract only the rows that were seniors (using filter) before you could get to work.

Extra 2: back in the days of phone plans that weren’t “unlimited”, you got charged by the minute, and phone companies used to round the number of minutes up, so that you would see what you see here: a call to a store to see whether it is open, that lasted about 10 seconds, would be counted as one minute for your plan.

Find the mean and standard standard deviation of the call lengths.

This is summarize, without even the need for a group_by:

calls %>% summarize(mean_length = mean(call_length),
                    sd_length = sd(call_length))

The mean is 8.3 minutes, and the standard deviation is 9.6 minutes. (The data are whole numbers, so give your answers with one or at most 2 decimals. The reader won’t be able to use these numbers as is: too many decimals. So you should give the answers and round them for your reader, rather than making the reader do extra work.)

It is good practice to give the summaries meaningful names, particularly if there are several summaries, or of several different variables. It matters a little less here, since there is only one variable, but it’s still a good idea.

Why might you doubt, even without looking at a graph, that the call lengths will resemble a normal distribution in shape? Explain briefly. You might find it helpful to use the fact that pnorm(z) works out how much of a standard normal distribution is less than the value z.

The first thing to consider is that these are lengths of time, so that they cannot be less than zero. With that in mind, the standard deviation looks awfully big.

Distributions with a limit (here 0) are often skewed away from that limit (here, to the right). This is particularly likely if there are values near the limit — here there are, because if you scan through the data, you’ll see a lot of very short calls. The large standard deviation comes from a long tail the other way: the smallish number of very long calls.

In summary, make two points:

there is a lower limit of 0 for call length
the reason for the SD being large is a long tail the other way: that is, some of the call lengths must be very large.

Another way to go is to think about what a normal distribution with this mean and SD would look like, and make the case that this makes no sense for these data. This is where you can use my hint about pnorm.

For example, if you think of the 68-95-99.7 rule for normal distributions, about 95% of the distribution should be between these two values (using R as a calculator):⁶

8.32 - 2 * 9.57

[1] -10.82

8.32 + 2 * 9.57

[1] 27.46

But this includes a lot of values below zero, which are impossible for lengths of phone calls.

Or, taking a similar angle, work out how much of a normal distribution with this mean and SD would be less than zero. My technique here is to once again use R as a calculator, but to do it as you would by hand (STAB22/STAB52 style):

z <- (0 - 8.32) / 9.57
z

[1] -0.8693835

pnorm(z)

[1] 0.1923187

Almost 20% of this distribution is below zero, which, as we said, makes no sense for lengths of phone calls. So this is another way to say that normality makes no sense here.

Extra: a code note — pnorm does the same thing that those normal tables at the back of the textbook do (the textbook for your first stats course): it turns a z into a “probability of less than”. In fact, pnorm will work out probability of less than for any normal distribution; you don’t have to standardize it first. To use this, add the mean and SD as second and third inputs, thus:

pnorm(0, 8.32, 9.57)

[1] 0.1923187

to get the same answer. The reason for the standardizing business in your first course is that making normal tables for every possible normal distribution would use an awful lot of paper, but it’s not too hard to turn a value from any normal distribution into one from a standard normal distribution, and so having a table of the standard normal is enough. With a handy computer, though, it can just calculate the value you need anytime.

Draw an appropriate graph of these data. Were your suspicions about shape confirmed?

One quantitative variable, so a histogram:

ggplot(calls, aes(x = call_length)) + geom_histogram(bins = 8)

Absolutely. That is certainly skewed to the right.

A one-sample boxplot is also possible:

ggplot(calls, aes(x = 1, y = call_length)) + geom_boxplot()

That is a very right-skewed distribution. (I like that better as a description compared with “there are a lot of outliers”, because the large number of unusual values⁷ added to a long upper whisker points to this being a feature of the whole distribution, ie. skewness, as opposed to a few unusual observations with the rest of the distribution having a consistent shape.)

Explain briefly why, nonetheless, using a $t$-test in this situation may be reasonable.

There are two considerations about whether to use a $t$-test. One is the shape of the data distribution, which as we have seen is very right-skewed. But the other is the sample size, which is here 200, very large. With such a large sample size, we can expect a lot of help from the Central Limit Theorem; it almost doesn’t matter what shape the data distribution is, and the $t$-test will still be good.

We don’t yet have a good intuition about how much that large sample size helps (it may turn out that the sample size is not large enough after all), which is why I phrased the question as I did: the way to think about it is “what piece of theory do I have that says that having a sample size like the one I do here could lead to a $t$-test being reasonable?”.

Extra: you might be finding it hard to believe that a sample size of 200 is enough to overcome the considerable right-skewness that you see in the plot. Later, we look at a technique for approximating the sampling distribution of the sample mean called the “bootstrap”. Here that looks like this (the code will make more sense later):

tibble(1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = 
           list(sample(calls$call_length, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> b
ggplot(b, aes(x = my_mean)) + geom_histogram(bins = 10)

That, despite the skewness in the distribution of the data, is pretty close to normal, and indicates that the sample size is large enough for the $t$-test to be valid. (Said differently, the sample size is large enough to overcome that (considerable) skewness.)

Test whether the mean length of all seniors’ calls in this city could be the same as the overall mean length of all calls made on the company’s network in that city, or whether it is different. What do you conclude, in the context of the data?

Test whether the population mean is 9.2 minutes (null hypothesis) against whether it is different from 9.2 minutes (two-sided alternative hypothesis). Two-sided is the default, so no alternative is needed in the code:

with(calls, t.test(call_length, mu = 9.2))


    One Sample t-test

data:  call_length
t = -1.3003, df = 199, p-value = 0.195
alternative hypothesis: true mean is not equal to 9.2
95 percent confidence interval:
 6.985424 9.654576
sample estimates:
mean of x 
     8.32

The P-value of 0.195 is not small (smaller than 0.05), so we have no reason to reject the null hypothesis. There is no evidence that the mean length of seniors’ phone calls differs from the overall mean.

When you’re doing a hypothesis-testing question, the two most important things are the P-value and the conclusion in the context of the data. You should also make it clear that you know what the null and alternative hypotheses are, either by explicitly stating them, or by saying something like “I fail to reject the null hypothesis” and following it with a statement in the context of the data that is consistent with failing to reject the correct null hypothesis. The reasons are:

if you give the P-value, you allow your reader to make their own decision if they disagree with your choice of $\alpha$, or, more generally, you convey the strength of the evidence against the null hypothesis (very weak in this case). In a learning context, giving the right P-value (especially if there is more than one) shows that you understand what is going on.
the reason for doing a hypothesis test is to make a decision about some specific population, and that will usually (in the real world) lead to some action that needs to be taken. The person for whom you are doing the test needs to know exactly what decision or action you are recommending. (Thus, in a course like this, saying “do not reject the null hypothesis” and stopping there is worth only half the points, if you are lucky.)

One important point to note is that just because the sample mean of 8.32 is a lot less than the hypothesized mean of 9.2, it does not mean that seniors must be different than the general population. Our test is saying that a sample of 200 seniors could easily have produced a sample mean of 8.32 even if the population mean were 9.2. Why? Because there is a lot of variability. Phone calls vary a lot in length, from a ten-second call to check whether a store is open, to catching up with a friend you haven’t seen for a while, which could go over 30 minutes. The test is two-sided, so it makes sense to look at the confidence interval, which tells the same story: with 95% confidence, the mean call length could be anywhere between 7.0 and 9.7 minutes.⁸

Extra 1: what the cellphone company really cares about is total usage of its system, and therefore how much it should charge its customers to use it. The mean length of a call, rather than, say, the median length, is what it wants to know about. (Along with the total number of calls made by each customer, which the phone company will have a record of.) Later, we will see the sign test, which makes no assumptions about the data distribution, but which also would test the median call length. So for this situation, the sign test would not work if you were unhappy about doing a $t$-test.

Extra 2: These are actually fake data. Sorry to disappoint you. But the survey is real, and the summaries are more or less real. I had to make up some data to fit the summaries, so that you would have an actual test to do in R.

The first thought is to generate random normal data. Except that we already said that a normal distribution would produce values less than zero, which are impossible for call lengths. It also seemed that call lengths would be recorded as whole numbers, and thus a discrete distribution is needed. The only discrete distribution you can probably think of is the binomial.

The basic idea of the binomial does work. That is, imagine trials where you can either succeed or fail, with a probability $p$ of succeeding every time. The binomial says that you have a fixed number of trials, and you count the number of successes. The problem here is that the binomial mean is less than the binomial variance (the standard deviation squared). The algebra is not hard: the mean is $np$, where $n$ is the number of trials, and the variance is $np(1-p)$. The variance differs from the mean by a factor of $1-p$, and $p$ is a probability, so this is less than 1.

In our case, though, the mean is 8.3 and even the standard deviation, 9.6, is bigger than the mean, so the variance must be way bigger.

So what distribution has variance bigger than the mean? One way to think about this is to flip around the binomial: instead of having a fixed number of trials and counting successes, have a fixed number of successes and count the number of trials you get before achieving them. Equivalently, you can count how many failures happen before you get the desired number of successes. You might imagine that it could take you a very long time to get to the required number of successes. Imagine you are playing Monopoly and you are in jail; it might take you a long time to roll doubles (probability $1/6$ each time) to get out of jail, were it not for the rule that if you fail three times to roll doubles, you have to pay a $50 fine to get out of jail.

Counting the number of trials (or, equivalently, the number of failures) gives you the so-called negative binomial distribution. This does have a variance bigger than its mean, so it will work here.⁹ I was actually aiming for a mean of 8 and an SD of 10 (and thus a variance of 100), these being what the original survey produced.

You can write the mean and SD of the negative binomial in terms of the success probability per trial and the number of successes needed before you can stop (called $n$ here). But you can also find the mean and variance directly. If the mean is $\mu$, the variance is $\mu + \mu^2/n$.¹⁰

The negative binomial, if you count failures until the $n$th success, starts at zero (you could succeed $n$ times right off the top), but we want our phone call lengths to start at 1 (the shortest possible phone call is, rounding up, 1 minute long). So I use a mean of 7 for my negative binomial, and add 1 back on at the end to get a mean of 8 and a minimum value of 1.

Thus we want $\mu = 7$. We also want a variance of $10^2=100$, so we need to figure out what $n$ has to be to make that happen. You can do this by trial and error,¹¹ or algebra. The algebra way says to put $\sigma^2 = \mu + \mu^2 / n$ and to solve for $n$, giving $n = \mu^2/(\sigma^2-\mu)$:

n <- 7^2/(100-7)
n

[1] 0.5268817

You might be wondering how we can count the failures before seeing half a success, but mathematically it works all right, so we let that go.¹²

Next, to generate some random data with that negative binomial distribution. R has functions whose names begin with r to generate random data; the one we need here is called rnbinom. Its inputs are the number of values to generate, the mean mu and the number of successes you need to get, called size:

calls <- tibble(call_length = rnbinom(200, mu = 7, size = n) + 1 )
calls

and that’s where your data came from. (The actual values here are different, because randomness, and their mean and SD are not quite what I was aiming for, also because randomness.)

Footnotes

R has no way to know what a logical order is for these data, unless we tell it.↩︎
You could have another column called Computer in here as well, and R would treat them as different. But that means that in R, case matters.↩︎
Though height_in_cm, if we had such a column, would match contains.↩︎
You might want not to do that if being missing is itself informative about heights. The sort of thing I mean is if students with short parents are more likely not to report their parents’ heights. In that case, a mean of parental heights would be biased because you have too few small ones.↩︎
rm is shorthand for “remove”, and NA is R’s notation for “missing”. Thus the function option na.rm = TRUE means “it is true that you remove the missing values” before calculating the mean. rm comes from the Unix operating system, which lives on as Linux, and is a command for removing files.↩︎
It doesn’t really matter how much you round off the SD because this calculation is only to give us a rough idea.↩︎
There is a distinction between points that are plotted separately on a boxplot, which are sometimes called “suspected outliers”, and actual outliers. Points that are plotted separately on a boxplot might be outliers, if there are a few of them a long way from the rest of the distribution, or they might be part of a long tail, if there are a lot of them not especially far from the whisker. These distribution falls into the second category. When Tukey was popularizing the boxplot, he imagined that it would be used with smaller sample sizes than this; his definition of “suspected outliers”, that business of “1.5 times IQR beyond the quartiles”, tends to produce a lot of them when the sample size is large.↩︎
If you wanted to estimate this more accurately, you would need an even bigger sample.↩︎
If you have run into the Poisson distribution, you see that it occupies a middle ground between the binomial (variance less than mean) and negative binomial (variance greater than mean) because the Poisson variance is equal to the mean.↩︎
Thus the variance must be bigger than the mean, because $\mu^2/n$ is positive no matter what the mean is.↩︎
This is the kind of thing a spreadsheet is good for. Put a trial value of $n$ in a cell, whatever you like, and in another cell have a formula that calculates the variance in terms of the mean and your cell with $n$ in it. Adjust the value in the $n$ cell until the variance comes out right. The formula for the variance has $n$ on the bottom, so to make the variance bigger you have to make $n$ smaller, and vice versa. If you have Goal Seek and know how to use it, that is also good here.↩︎
In calculations with the negative binomial distribution, R uses gamma functions rather than factorials, so the n parameter doesn’t have to be an integer.↩︎