Worksheet 4

Published

February 2, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

If you don’t get to the end in tutorial, it’s a good idea to finish them on your own time this week.

1 Seniors and cellphones

A cellphone company is thinking about offering a discount to new senior customers (aged 65 and over), but first wants to know whether seniors differ in their usage of cellphone services. The company knows that, for all its current customers in a certain city, that the mean length of a voice call is 9.2 minutes, and wants to know whether its current senior customers have the same or a different average length. In a recent survey, the cellphone company contacted a large number of its current customers, and asked for (among other things) the customer’s age group and when they made their last call. The length of that call was determined from the company’s records. There were 200 seniors in the survey.

The data are in http://ritsokiguess.site/datafiles/senior_phone.csv. These are only the seniors.

  1. Read in and display (some of) the data.

  2. Find the mean and standard standard deviation of the call lengths.

  3. Why might you doubt, even without looking at a graph, that the call lengths will resemble a normal distribution in shape? Explain briefly. You might find it helpful to use the fact that pnorm(z) works out how much of a standard normal distribution is less than the value z.

  4. Draw an appropriate graph of these data. Were your suspicions about shape confirmed?

  5. Explain briefly why, nonetheless, using a \(t\)-test in this situation may be reasonable.

  6. Test whether the mean length of all seniors’ calls in this city could be the same as the overall mean length of all calls made on the company’s network in that city, or whether it is different. What do you conclude, in the context of the data?

My solutions

(a) Read in and display (some of) the data.

my_url <- "http://ritsokiguess.site/datafiles/senior_phone.csv"
calls <- read_csv(my_url)
Rows: 200 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): call_length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
calls

One column, called call_length, containing the call lengths, evidently in whole numbers of minutes, and 200 rows, one for each senior customer.

(b) Find the mean and standard standard deviation of the call lengths.

This is summarize, without even the need for a group_by:

calls %>% summarize(mean_length = mean(call_length),
                    sd_length = sd(call_length))

The mean is 8.3 minutes, and the standard deviation is 9.6 minutes. (The data are whole numbers, so give your answers with one or at most 2 decimals. The reader won’t be able to use these numbers as is: too many decimals. So you should give the answers and round them for your reader, rather than making the reader do extra work.)

It is good practice to give the summaries meaningful names, particularly if there are several summaries, or of several different variables. It matters a little less here, since there is only one variable, but it’s still a good idea.

(c) Why might you doubt, even without looking at a graph, that the call lengths will resemble a normal distribution in shape? Explain briefly. You might find it helpful to use the fact that pnorm(z) works out how much of a standard normal distribution is less than the value z.

The first thing to consider is that these are lengths of time, so that they cannot be less than zero. With that in mind, the standard deviation looks awfully big.

Distributions with a limit (here 0) are often skewed away from that limit (here, to the right). This is particularly likely if there are values near the limit — here there are, because if you scan through the data, you’ll see a lot of very short calls. The large standard deviation comes from a long tail the other way: the smallish number of very long calls.

In summary, make two points:

  • there is a lower limit of 0 for call length
  • the reason for the SD being large is a long tail the other way: that is, some of the call lengths must be very large.

Another way to go is to think about what a normal distribution with this mean and SD would look like, and make the case that this makes no sense for these data. This is where you can use my hint about pnorm.

For example, if you think of the 68-95-99.7 rule for normal distributions, about 95% of the distribution should be between these two values (using R as a calculator):1

8.32 - 2 * 9.57
[1] -10.82
8.32 + 2 * 9.57
[1] 27.46

But this includes a lot of values below zero, which are impossible for lengths of phone calls.

Or, taking a similar angle, work out how much of a normal distribution with this mean and SD would be less than zero. My technique here is to once again use R as a calculator, but to do it as you would by hand (STAB22/STAB52 style):

z <- (0 - 8.32) / 9.57
z
[1] -0.8693835
pnorm(z)
[1] 0.1923187

Almost 20% of this distribution is below zero, which, as we said, makes no sense for lengths of phone calls. So this is another way to say that normality makes no sense here.

(d) Draw an appropriate graph of these data. Were your suspicions about shape confirmed?

One quantitative variable, so a histogram:

ggplot(calls, aes(x = call_length)) + geom_histogram(bins = 8)

Absolutely. That is certainly skewed to the right.

A one-sample boxplot is also possible:

ggplot(calls, aes(x = 1, y = call_length)) + geom_boxplot()

That is a very right-skewed distribution. (I like that better as a description compared with “there are a lot of outliers”, because the large number of unusual values2 added to a long upper whisker points to this being a feature of the whole distribution, ie. skewness, as opposed to a few unusual observations with the rest of the distribution having a consistent shape.)

(e) Explain briefly why, nonetheless, using a \(t\)-test in this situation may be reasonable.

There are two considerations about whether to use a \(t\)-test. One is the shape of the data distribution, which as we have seen is very right-skewed. But the other is the sample size, which is here 200, very large. With such a large sample size, we can expect a lot of help from the Central Limit Theorem; it almost doesn’t matter what shape the data distribution is, and the \(t\)-test will still be good.

We don’t yet have a good intuition about how much that large sample size helps (it may turn out that the sample size is not large enough after all), which is why I phrased the question as I did: the way to think about it is “what piece of theory do I have that says that having a sample size like the one I do here could lead to a \(t\)-test being reasonable?”.

(f) Test whether the mean length of all seniors’ calls in this city could be the same as the overall mean length of all calls made on the company’s network in that city, or whether it is different. What do you conclude, in the context of the data?

Test whether the population mean is 9.2 minutes (null hypothesis) against whether it is different from 9.2 minutes (two-sided alternative hypothesis). Two-sided is the default, so no alternative is needed in the code:

with(calls, t.test(call_length, mu = 9.2))

    One Sample t-test

data:  call_length
t = -1.3003, df = 199, p-value = 0.195
alternative hypothesis: true mean is not equal to 9.2
95 percent confidence interval:
 6.985424 9.654576
sample estimates:
mean of x 
     8.32 

The P-value of 0.195 is not small (smaller than 0.05), so we have no reason to reject the null hypothesis. There is no evidence that the mean length of seniors’ phone calls differs from the overall mean.

When you’re doing a hypothesis-testing question, the two most important things are the P-value and the conclusion in the context of the data. You should also make it clear that you know what the null and alternative hypotheses are, either by explicitly stating them, or by saying something like “I fail to reject the null hypothesis” and following it with a statement in the context of the data that is consistent with failing to reject the correct null hypothesis. The reasons are:

  • if you give the P-value, you allow your reader to make their own decision if they disagree with your choice of \(\alpha\), or, more generally, you convey the strength of the evidence against the null hypothesis (very weak in this case). In a learning context, giving the right P-value (especially if there is more than one) shows that you understand what is going on.
  • the reason for doing a hypothesis test is to make a decision about some specific population, and that will usually (in the real world) lead to some action that needs to be taken. The person for whom you are doing the test needs to know exactly what decision or action you are recommending. (Thus, in a course like this, saying “do not reject the null hypothesis” and stopping there is worth only half the points, if you are lucky.)

One important point to note is that just because the sample mean of 8.32 is a lot less than the hypothesized mean of 9.2, it does not mean that seniors must be different than the general population. Our test is saying that a sample of 200 seniors could easily have produced a sample mean of 8.32 even if the population mean were 9.2. Why? Because there is a lot of variability. Phone calls vary a lot in length, from a ten-second call to check whether a store is open, to catching up with a friend you haven’t seen for a while, which could go over 30 minutes. The test is two-sided, so it makes sense to look at the confidence interval, which tells the same story: with 95% confidence, the mean call length could be anywhere between 7.0 and 9.7 minutes.3

2 Home prices

A realtor kept track of the asking prices of 37 homes for sale in West Lafayette, Indiana, in a particular year. The asking prices are in http://ritsokiguess.site/datafiles/homes.csv. There are two columns, the asking price (in $) and the number of bedrooms that home has (either 3 or 4, in this dataset). The realtor was interested in whether the mean asking price for 4-bedroom homes was bigger than for 3-bedroom homes.

  1. Read in and display (some of) the data.

  2. Draw a suitable graph of these data.

  3. Comment briefly on your plot. Does it suggest an answer to the realtor’s question? Do you have any doubts about the appropriateness of a \(t\)-test in this situation? Explain briefly.

  4. Sometimes prices work better on a log scale. This is because percent changes in prices are often of more interest than absolute dollar-value changes. Re-draw your plot using logs of asking prices. (In R, log() takes natural (base \(e\)) logs, which are fine here.) Do you like the shapes of the distributions better? Hint: you have a couple of options. One is to use the log right in your plotting (or, later, testing) functions. Another is to define a new column containing the log-prices and work with that.

  5. Run a suitable \(t\)-test to compare the log-prices. What do you conclude?

My solutions

(a) Read in and display (some of) the data.

The exact usual:

my_url <- "http://ritsokiguess.site/datafiles/homes.csv"
asking <- read_csv(my_url)
Rows: 37 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): price, bdrms

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
asking

There are indeed 37 homes; the first of them have 4 bdrms, and the ones further down, if you scroll, have three. The price column does indeed look like asking prices of homes for sale.4

(b) Draw a suitable graph of these data.

Two groups of prices to compare, or one quantitative column and one column that appears to be categorical (it’s actually a number, but it’s playing the role of a categorical or grouping variable). So a boxplot. This requires care, though; if you do it without thinking you’ll get this:

ggplot(asking, aes(x = bdrms, y = price)) + geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

This makes no sense — there are supposed to be two groups of houses, and this plot has only one. What happened? The warning is the clue: the number of bedrooms looks quantitative (“continuous”) and so ggplot has tried (and failed) to treat it as such.

The perhaps most direct way around this is to take the warning message at face value and add bdrms as a group, thus:

ggplot(asking, aes(x = bdrms, y = price, group = bdrms)) + geom_boxplot()

and that works (as you will see below, it is effectively the same as the other methods that require a bit more thought).

You might be thinking that this is something like black magic, so I offer another idea where you have a fighting chance of understanding what is being done.

The problem is that bdrms looks like a quantitative variable (it has values 3 and 4 that are numbers), but we want it to be treated as a categorical variable. The easiest way to turn it into one is via factor, like this:

ggplot(asking, aes(x = factor(bdrms), y = price)) + geom_boxplot()

If the funny label on the \(x\)-axis bothers you, and it probably should,5 define a new variable first that is the factor version of bdrms. You can overwrite the old bdrms since we will not need the number as a number anywhere in this question:6

asking %>% 
  mutate(bdrms = factor(bdrms)) -> asking
ggplot(asking, aes(x = bdrms, y = price)) + geom_boxplot()

and that works smoothly.7

As a very quick extra: factor(bdrms) and group = bdrms both correctly give two boxplots side by side, but if you look carefully, the shaded grey area in the background of the graph is slightly different in each case. The group = way still treats bdrms as quantitative, and the \(x\)-axis reflects that (there is an axis “tick” at 3.5 bedrooms), but the factor(bdrms) plot treats the made-categorical bdrms as a genuine categorical variable with the values 3 and 4 and nothing else (the \(x\)-axis only has ticks at 3 and 4). From that point of view, the group = bdrms plot is a bit of a hack: it makes the boxplots come out right without fixing up the \(x\)-axis.

(c) Comment briefly on your plot. Does it suggest an answer to the realtor’s question? Do you have any doubts about the appropriateness of a \(t\)-test in this situation? Explain briefly.

It seems pretty clear that the average (on this plot, median) asking price for 4-bedroom houses is higher than for 3-bedroom houses. However, for a \(t\)-test to be appropriate, we need approximately normal distributions within each group of asking prices, and you can reasonably say that we do not: both distributions of asking prices are skewed to the right, and the 3-bedroom asking prices have three outliers at the top end.8

The other thing you need to consider is sample size: there are 37 houses altogether, so about 20 in each group:

asking %>% count(bdrms)

Thus the Central Limit Theorem will offer some help, but you could reasonably argue that even a sample size of 23 won’t be enough to fix up that skewness and those outliers in the 3-bedroom group.

(d) Sometimes prices work better on a log scale. This is because percent changes in prices are often of more interest than absolute dollar-value changes. Re-draw your plot using logs of asking prices. (In R, log() takes natural (base \(e\)) logs, which are fine here.) Do you like the shapes of the distributions better? Hint: you have a couple of options. One is to use the log right in your plotting (or, later, testing) functions. Another is to define a new column containing the log-prices and work with that.

You can put the log right in the ggplot command, thus:

ggplot(asking, aes(x = bdrms, y = log(price))) + geom_boxplot()

These look a lot better. The 4-bedroom distribution is close to symmetric and the 3-bedroom distribution is much less skewed (and has lost its outliers).

For this, and the sample sizes we have, I would now have no problem at all with a \(t\)-test.

The other way to do this is to make a new column that has the log-price in it:

asking %>% 
  mutate(log_price = log(price)) -> asking

and then make the plot:

ggplot(asking, aes(x = bdrms, y = log_price)) + geom_boxplot()

Both ways come out the same, and are equally good.

For the second way, it is better to save a dataframe with the log-prices in it and then make a plot, because we will be using the log-prices in our hypothesis test in a moment. If you use a pipeline here, like this:

asking %>% 
  mutate(log_price = log(price)) %>% 
  ggplot(aes(x = bdrms, y = log_price)) + geom_boxplot()

it works here, but you will have to define the log-prices again below. If you don’t see that now, that’s OK, but when you come to do the \(t\)-test with the log-prices in the next part, you ought to realize that you are doing something inefficient by calculating the log-prices again, so you should come back here and save the dataframe with the log-prices in it so that you don’t have to calculate them again. Or, I guess, use the log-prices directly in the \(t\)-test, but it seems odd to do one thing one way and the other thing a different way.

(e) Run a suitable \(t\)-test to compare the log-prices. What do you conclude?

Bear in mind what the realtor wants to know: whether the mean (log-) price is higher for 4-bedroom houses vs. 3-bedroom houses. This was something the realtor was curious about before they even looked at the data, so a one-sided test is appropriate. 3 is less than 4, so the alternative will be "less". Once again, you can put the log directly into the t.test, or use a column of log-prices that you create (such as the one you did for the boxplot, if you did that). Thus, two possibilities are:

t.test(log(price)~bdrms, data = asking, alternative = "less")

    Welch Two Sample t-test

data:  log(price) by bdrms
t = -5.1887, df = 30.59, p-value = 6.481e-06
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
       -Inf -0.4139356
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

and (my new column was called log_price):

t.test(log_price ~ bdrms, data = asking, alternative = "less")

    Welch Two Sample t-test

data:  log_price by bdrms
t = -5.1887, df = 30.59, p-value = 6.481e-06
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
       -Inf -0.4139356
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

The P-value and conclusion are the same either way. The P-value is 0.0000065, way less than 0.05, so there is no doubt that the mean (log-) asking price is higher for 4-bedroom homes than it is for 3-bedroom homes.

Side note: t.test is more forgiving than ggplot was with bdrms. Before, we had to wrap it in factor to get it treated as a categorical variable. It is reasonable enough to do that here as well (it works either way), and using factor(bdrms) shows that you are suspecting that there might be a problem again, which is intelligent. t.test, however, like other things from the early days of R,9 is more forgiving: it uses the distinct values of the variable on the right of the squiggle (bdrms) to make groups, whether they are text or numbers. Since the two-sample \(t\)-test is for comparing exactly two groups, it will complain if bdrms has more than two distinct values, but here we are good.

The other thing you should consider is whether we should have done a Welch or a pooled test. This one is, as you see, Welch, but a pooled test would be better if the two groups of log-prices had equal spreads. Go back and look at the last boxplot you did: on the log scale, the two spreads do actually look pretty similar.10 So we could also have done the pooled test. My guess (I haven’t looked at the results yet as I type this) is that the results will be almost identical in fact:

t.test(log_price ~ bdrms, data = asking, alternative = "less", var.equal = TRUE)

    Two Sample t-test

data:  log_price by bdrms
t = -5.0138, df = 35, p-value = 7.693e-06
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
       -Inf -0.4077387
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

The test statistic and P-value are very close, and the conclusion is identical, so it didn’t matter which test you used. But the best answer will at least consider whether a pooled or a Welch test is the better one to use.

3 Extras

For the senior phone question:

Extra 1: if you were employed by the cellphone company, you would probably receive a spreadsheet containing lots of columns, maybe even the whole survey, with each customer’s phone number (or account number), and a lot of other columns containing responses to other questions on the survey. One of the columns would be age group, and then you would have to extract only the rows that were seniors (using filter) before you could get to work.

Extra 2: back in the days of phone plans that weren’t “unlimited”, you got charged by the minute, and phone companies used to round the number of minutes up, so that you would see what you see here: a call to a store to see whether it is open, that lasted about 10 seconds, would be counted as one minute for your plan.

Extra: a code note — pnorm does the same thing that those normal tables at the back of the textbook do (the textbook for your first stats course): it turns a z into a “probability of less than”. In fact, pnorm will work out probability of less than for any normal distribution; you don’t have to standardize it first. To use this, add the mean and SD as second and third inputs, thus:

pnorm(0, 8.32, 9.57)
[1] 0.1923187

to get the same answer. The reason for the standardizing business in your first course is that making normal tables for every possible normal distribution would use an awful lot of paper, but it’s not too hard to turn a value from any normal distribution into one from a standard normal distribution, and so having a table of the standard normal is enough. With a handy computer, though, it can just calculate the value you need anytime.

Extra: you might be finding it hard to believe that a sample size of 200 is enough to overcome the considerable right-skewness that you see in the plot. Later, we look at a technique for approximating the sampling distribution of the sample mean called the “bootstrap”. Here that looks like this (the code will make more sense later):

tibble(1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = 
           list(sample(calls$call_length, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> b
ggplot(b, aes(x = my_mean)) + geom_histogram(bins = 10)

That, despite the skewness in the distribution of the data, is pretty close to normal, and indicates that the sample size is large enough for the \(t\)-test to be valid. (Said differently, the sample size is large enough to overcome that (considerable) skewness.)

Extra 1: what the cellphone company really cares about is total usage of its system, and therefore how much it should charge its customers to use it. The mean length of a call, rather than, say, the median length, is what it wants to know about. (Along with the total number of calls made by each customer, which the phone company will have a record of.) Later, we will see the sign test, which makes no assumptions about the data distribution, but which also would test the median call length. So for this situation, the sign test would not work if you were unhappy about doing a \(t\)-test.

Extra 2: These are actually fake data. Sorry to disappoint you. But the survey is real, and the summaries are more or less real. I had to make up some data to fit the summaries, so that you would have an actual test to do in R.

The first thought is to generate random normal data. Except that we already said that a normal distribution would produce values less than zero, which are impossible for call lengths. It also seemed that call lengths would be recorded as whole numbers, and thus a discrete distribution is needed. The only discrete distribution you can probably think of is the binomial.

The basic idea of the binomial does work. That is, imagine trials where you can either succeed or fail, with a probability \(p\) of succeeding every time. The binomial says that you have a fixed number of trials, and you count the number of successes. The problem here is that the binomial mean is less than the binomial variance (the standard deviation squared). The algebra is not hard: the mean is \(np\), where \(n\) is the number of trials, and the variance is \(np(1-p)\). The variance differs from the mean by a factor of \(1-p\), and \(p\) is a probability, so this is less than 1.

In our case, though, the mean is 8.3 and even the standard deviation, 9.6, is bigger than the mean, so the variance must be way bigger.

So what distribution has variance bigger than the mean? One way to think about this is to flip around the binomial: instead of having a fixed number of trials and counting successes, have a fixed number of successes and count the number of trials you get before achieving them. Equivalently, you can count how many failures happen before you get the desired number of successes. You might imagine that it could take you a very long time to get to the required number of successes. Imagine you are playing Monopoly and you are in jail; it might take you a long time to roll doubles (probability \(1/6\) each time) to get out of jail, were it not for the rule that if you fail three times to roll doubles, you have to pay a $50 fine to get out of jail.

Counting the number of trials (or, equivalently, the number of failures) gives you the so-called negative binomial distribution. This does have a variance bigger than its mean, so it will work here.11 I was actually aiming for a mean of 8 and an SD of 10 (and thus a variance of 100), these being what the original survey produced.

You can write the mean and SD of the negative binomial in terms of the success probability per trial and the number of successes needed before you can stop (called \(n\) here). But you can also find the mean and variance directly. If the mean is \(\mu\), the variance is \(\mu + \mu^2/n\).12

The negative binomial, if you count failures until the \(n\)th success, starts at zero (you could succeed \(n\) times right off the top), but we want our phone call lengths to start at 1 (the shortest possible phone call is, rounding up, 1 minute long). So I use a mean of 7 for my negative binomial, and add 1 back on at the end to get a mean of 8 and a minimum value of 1.

Thus we want \(\mu = 7\). We also want a variance of \(10^2=100\), so we need to figure out what \(n\) has to be to make that happen. You can do this by trial and error,13 or algebra. The algebra way says to put \(\sigma^2 = \mu + \mu^2 / n\) and to solve for \(n\), giving \(n = \mu^2/(\sigma^2-\mu)\):

n <- 7^2/(100-7)
n
[1] 0.5268817

You might be wondering how we can count the failures before seeing half a success, but mathematically it works all right, so we let that go.14

Next, to generate some random data with that negative binomial distribution. R has functions whose names begin with r to generate random data; the one we need here is called rnbinom. Its inputs are the number of values to generate, the mean mu and the number of successes you need to get, called size:

calls <- tibble(call_length = rnbinom(200, mu = 7, size = n) + 1 )
calls

and that’s where your data came from. (The actual values here are different, because randomness, and their mean and SD are not quite what I was aiming for, also because randomness.)

For the homes question:

Extra: as I originally conceived this question, I was going to have you finish by finding a confidence interval to quantify how different the mean (log-) prices are. The problem with that here is that you get, if you re-do it two-sided, a confidence interval for the difference in mean log-prices, not an easy thing to interpret:

t.test(log_price ~ bdrms, data = asking)

    Welch Two Sample t-test

data:  log_price by bdrms
t = -5.1887, df = 30.59, p-value = 1.296e-05
alternative hypothesis: true difference in means between group 3 and group 4 is not equal to 0
95 percent confidence interval:
 -0.8568304 -0.3731165
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

Some thinking required: this is a difference of means of logs of prices. How can we say something about actual prices here? Let’s ignore the mean part for now; the scale these things are on is log-prices. What do we know about differences of logs? Haul out some math here:15

\[ \log a - \log b = \log(a/b), \] so

\[ \exp (\log a - \log b) = a/b.\]

So how does this apply to our confidence interval? What it says is that if you take the confidence interval for the difference in means of log-prices, and exp its endpoints, what you get is a confidence interval for the ratio of means of the actual prices:

ci_log <- c(-0.8568304,-0.3731165)
exp(ci_log)
[1] 0.4245055 0.6885850

This says that the average asking price for the 3-bedroom houses is between 42 and 69 percent of the average asking price for the 4-bedroom houses. Thus the asking prices for the 3-bedroom houses are quite a bit less on average.16 Thus it is not at all surprising that the P-value was so small, whether you did pooled or Welch.

Footnotes

  1. It doesn’t really matter how much you round off the SD because this calculation is only to give us a rough idea.↩︎

  2. There is a distinction between points that are plotted separately on a boxplot, which are sometimes called “suspected outliers”, and actual outliers. Points that are plotted separately on a boxplot might be outliers, if there are a few of them a long way from the rest of the distribution, or they might be part of a long tail, if there are a lot of them not especially far from the whisker. These distribution falls into the second category. When Tukey was popularizing the boxplot, he imagined that it would be used with smaller sample sizes than this; his definition of “suspected outliers”, that business of “1.5 times IQR beyond the quartiles”, tends to produce a lot of them when the sample size is large.↩︎

  3. If you wanted to estimate this more accurately, you would need an even bigger sample.↩︎

  4. At least, for somewhere that is not Toronto!↩︎

  5. Note that the group idea I showed you first gives you a perfectly reasonable axis label.↩︎

  6. I am not asking for anything like the mean number of bedrooms anywhere here.↩︎

  7. Turning the number of bedrooms into text, via as.character(bdrms), also works. This is actually how we have been handling categorical variables so far: reading them in as text. Usually, though, they look like text. Here they don’t.↩︎

  8. I would be happy to call these genuine outliers, because there are only three of them, and they look a little separated from the whisker, so that it is reasonable to say that these three asking prices are bigger than the others.↩︎

  9. You might see cbind from base R in other courses; I use it because it is more forgiving than the tidyverse ways of gluing things together.↩︎

  10. The spread of actual asking prices without taking logs is bigger for the 4-bedroom houses, so if we had been willing to do a \(t\)-test without taking logs first, we should definitely have preferred the Welch test. The effect of taking logs is to bring the higher values down, compared to the lower ones, which made both distributions less right-skewed and also made the spreads more equal. For this reason, the log transformation is often useful: it can both equalize spread and make things look more normal, all at once.↩︎

  11. If you have run into the Poisson distribution, you see that it occupies a middle ground between the binomial (variance less than mean) and negative binomial (variance greater than mean) because the Poisson variance is equal to the mean.↩︎

  12. Thus the variance must be bigger than the mean, because \(\mu^2/n\) is positive no matter what the mean is.↩︎

  13. This is the kind of thing a spreadsheet is good for. Put a trial value of \(n\) in a cell, whatever you like, and in another cell have a formula that calculates the variance in terms of the mean and your cell with \(n\) in it. Adjust the value in the \(n\) cell until the variance comes out right. The formula for the variance has \(n\) on the bottom, so to make the variance bigger you have to make \(n\) smaller, and vice versa. If you have Goal Seek and know how to use it, that is also good here.↩︎

  14. In calculations with the negative binomial distribution, R uses gamma functions rather than factorials, so the n parameter doesn’t have to be an integer.↩︎

  15. For those of us old enough to remember times before calculators (which I am, just), this is how we would do division if we couldn’t do it by long division. We used to have books called “log tables” in which you could look up base-10 logs of anything. Look up the log of the thing on the top of the division, look up the log of the thing on the bottom, subtract, and then turn to the “antilog” tables and look up the result there to find the answer. exp is playing the role of antilog here. Example: to work out \(4/3\), look up the (base 10) log of 4, which is 0.602, and the log of 3, which is 0.477. Subtract to get 0.125. I happen to remember that the base 10 log of 1.3 is 0.114, so \(4/3\) is a bit bigger than 1.3, as it is. An antilog table would give the answer more precisely.↩︎

  16. This is what I meant earlier when I said that with logs, percent changes are the ones of interest.↩︎