Worksheet 4

Published

October 2, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Prison stress

Being in prison is stressful, for anybody. 26 prisoners took part in a study where their stress was measured at the start and end of the study. Some of the prisoners, chosen at random, completed a physical training program (for these prisoners, the Group column is Sport) and some did not (Group is Control). The researchers’ main aim was to see whether the physical training program reduced stress on average in the population of prisoners. The data are in http://www.ritsokiguess.site/datafiles/PrisonStress.csv, in four columns, respectively an identifier for the prisoner, whether or not they did physical training, their stress score at the start of the study, and their stress score at the end.

  1. Read in and display (some of) the data.

Very much the usual. Give the dataframe a name of your choosing; stress is good as a name because none of the columns are actually called stress:

my_url <- "http://www.ritsokiguess.site/datafiles/PrisonStress.csv"
stress <- read_csv(my_url)
Rows: 26 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, Group
dbl (2): PSSbefore, PSSafter

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stress

As you see, the columns of stress scores are called PSSbefore and PSSafter, which is because PSS was the name of the stress scale the researchers used.

  1. Make a suitable graph of the stress scores at the end of the study and whether or not each prisoner was in the Sport group.

These are the columns PSSafter (quantitative) and Group (categorical), so as is the way with these things, you need a boxplot:

ggplot(stress, aes(x = Group, y = PSSafter)) + geom_boxplot()

Unlike the one on the midterm, you don’t need to make the boxes go left and right (although no harm if you do). A vertical boxplot is fine here.

An alternative is to reason that you have one quantitative variable (PSSafter) and “too many” categorical variables (Group), so you do histograms facetted by Group:

ggplot(stress, aes(x = PSSafter)) + geom_histogram(bins = 5) +
  facet_wrap(~ Group, ncol = 1)

choosing a suitable number of bins (I think about 6 bins is as high as you want to go).

  1. Run the most appropriate \(t\)-test to compare the stress scores at the end of the study for the two groups of prisoners. Bear in mind what the researchers are trying to show. What do you conclude from your test, in the context of the data?

This means comparing PSSafter between the two groups defined in Group. The researchers wanted to show that the average (mean) stress score is lower for the prisoners who did the physical training, so we need a one-sided test. The two groups in Group are Sport and Control; the second of these is first alphabetically, so to get the right test we need to say how Control compares to Sport in that (alphabetical) order.

The other thing to consider is whether we should be doing a Welch or a pooled test. I’m prepared to entertain either of these, as long as you state a reason for doing the one you do. My take is that the two groups differ slightly in spread (the boxes on the boxplots differ slightly in height), so I would do the Welch test:

t.test(PSSafter ~ Group, data = stress, 
       alternative = "greater")

    Welch Two Sample t-test

data:  PSSafter by Group
t = 1.3361, df = 21.325, p-value = 0.09781
alternative hypothesis: true difference in means between group Control and group Sport is greater than 0
95 percent confidence interval:
 -1.069768       Inf
sample estimates:
mean in group Control   mean in group Sport 
             23.72727              20.00000 

This gives a P-value of 0.09781, which is not smaller than 0.05, so I cannot reject the null hypothesis, and so there is no evidence here that the physical training reduces stress on average in prisoners.

I think you could also reasonably say that the two groups do not differ substantially in spread, on the basis that the boxes on the boxplot are not very different in height, and therefore that the pooled test would also work:

t.test(PSSafter ~ Group, data = stress, 
      var.equal = TRUE, alternative = "greater")

    Two Sample t-test

data:  PSSafter by Group
t = 1.3424, df = 24, p-value = 0.09601
alternative hypothesis: true difference in means between group Control and group Sport is greater than 0
95 percent confidence interval:
 -1.023091       Inf
sample estimates:
mean in group Control   mean in group Sport 
             23.72727              20.00000 

The P-value is almost identical, and the conclusion is the same, that there is no evidence that the physical training is effective in reducing stress.

The question to ask yourself, when looking over this afterwards, is therefore not “did I do the right test?”, but instead “did I do the test I did for a good reason?”.

  1. Make a suitable plot of the stress measurements before the study for each group of prisoners. How, if at all, does that impact the conclusion you drew in the previous part? Explain briefly.

This is another boxplot, which is not in itself very exciting, but what is interesting is the conclusion it helps us to draw:

ggplot(stress, aes(x = Group, y = PSSbefore)) +
  geom_boxplot()

This shows that the prisoners who did the physical training had quite a bit more stress at the start of the study than the control group. Compare that with your first boxplot, which said that after the study, the physical-training group had a slightly lower stress score than the control group.

Putting those two things together, the reduction in stress between the two groups was actually larger than our comparison of the after scores led us to believe. For example, you could look at the medians before and after (reading them off the boxplot): the control group went up from 15 to 26, while the Sport group went down from 23 to 21. (Looking at these numbers, the implication seems to be that stress will increase over time, but putting the prisoners through a physical training program will at least keep it about the same.)

Extra: that’s as far as you needed to go, but it’s worth thinking about how you might do that. One approach is to account for the before measurements somehow, possibly via a regression (which leads on to the technique known as analysis of covariance). A simplified version of that is to look at the difference between after and before for each subject, which is asking “did the Sport subjects improve by more than the Control ones did?”. The idea of taking differences between two measurements on the same subject might remind you of matched pairs.1 This is an odd experimental design, though, because it has elements of both two independent samples (the subjects doing Sport vs. the Control ones), and matched pairs (two observations for each subject).

Anyway, let’s work out the differences as after minus before, and then worry about what kind of difference we would see if there was a training effect. We’ll start with a boxplot of the differences:

stress %>% 
  mutate(diff = PSSafter - PSSbefore) -> PrisonStress
ggplot(PrisonStress, aes(x = Group, y = diff)) + geom_boxplot()

This is starting to look like a significant difference, and in the right direction: the Control group’s stress has gone up a little between before and after, and the Sport group’s stress has come down, by something like 5 points in both cases.

So let’s compare the differences for the two groups, once again with a two-sample t-test. I would favour Welch here (those two spreads do look different), and I have no particular concerns with normality, given the sample sizes:

t.test(diff ~ Group, data = PrisonStress, alternative = "greater")

    Welch Two Sample t-test

data:  diff by Group
t = 3.5908, df = 15.461, p-value = 0.001282
alternative hypothesis: true difference in means between group Control and group Sport is greater than 0
95 percent confidence interval:
 5.792541      Inf
sample estimates:
mean in group Control   mean in group Sport 
             7.363636             -3.933333 

With a P-value of 0.0013, there is certainly a significant difference in the differences2 between the Sport and Control groups; that is to say, the effect of doing Sport is to reduce the stress from what it was before, in comparison to the Control group where stress actually went up. The differences in stress between the two groups are different on average.

So that’s what the researchers were looking for.

Another graph you might draw is based on thinking of the after stress value as a response and the before one as explanatory (both quantitative), with the treatment group as categorical. This suggests drawing a scatterplot with the points labelled by treatment group:

ggplot(PrisonStress, aes(x = PSSbefore, y = PSSafter, colour = Group)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

There’s still a good bit of scatter, but most of the red points are above and most of the blue points are below, particularly if you compare subjects with similar before scores. The lines help us judge that: the red line is above the blue one, meaning that for any before score, the after score is higher for someone in the control group and lower for someone in the Sport group. That is, once you allow for how much stress a subject had before, they will have less stress after doing the physical training than they would if they were in the control group.

Analysis of covariance is based on fitting regression lines to the two separate groups (which is what the red and blue lines actually are). We don’t talk about regression until much later in the course, but the ideas are these:

stress.1 <- lm(PSSafter ~ PSSbefore + Group, data = PrisonStress)
summary(stress.1)

Call:
lm(formula = PSSafter ~ PSSbefore + Group, data = PrisonStress)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.564  -2.759   0.139   3.569   9.309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.0909     2.7358   5.882 5.39e-06 ***
PSSbefore     0.4667     0.1298   3.595  0.00153 ** 
GroupSport   -7.2598     2.4731  -2.935  0.00743 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.717 on 23 degrees of freedom
Multiple R-squared:  0.4044,    Adjusted R-squared:  0.3526 
F-statistic: 7.809 on 2 and 23 DF,  p-value: 0.00258

The significantly positive number beside PSSbefore is an ordinary regression slope.3 This says that if your stress score before is higher by 1, your stress score after is expected to be higher by 0.47 units on this scale. This is not terribly surprising, and also not really what we care about. The significantly negative number next to GroupSport says that if you were in the physical training group, your stress score afterwards is on average lower by over 7 points, compared to if you were in the “baseline” Control group, even if your stress score before was the same. This is really quantifying the effect of the treatment while also allowing for or adjusting for other reasons the groups might have been different (like, the Sport group had higher stress before compared to the control group). The two-sample t-test you did on the After stress scores did not account for how the groups might have been different before (it assumed they were the same up to randomization, which it seems they were not), while this last analysis (and the one based on the differences) both account for everything that is going on. I hope you are feeling that these more sophisticated analyses are rather more satisfactory than the two-sample t-test you did.

  1. Going back to your plot of the second part of this question (the boxplot of after scores against group), why might you be concerned about the Control group of prisoners for your \(t\)-test? Explain briefly (two reasons).

I can’t remember all the way back there, so I’ll draw the boxplot again:

ggplot(stress, aes(x = Group, y = PSSafter)) + geom_boxplot()

The control group distribution appears to be skewed to the left (long lower whisker). One point. The second point comes from looking at the sample size, as usual. You can either go back to your listing of the data (in (a)) and physically count how many observations are in the control group, or (better) you can use the fact that you know how to count things in R:

stress %>% count(Group)

There are only 11 observations in the Control group, so the Central Limit Theorem will help us some, but maybe not very much. So we should have some concern about the validity of our two-sample \(t\)-test, on the basis that one of the groups doesn’t look normal enough given the smallness of the sample.

If you drew histograms, you ought to get to about the same place: make some comment about the non-normality of the Control group (on mine it looks like a low outlier, but depending on your number of bins, it might look like a long left tail) together with a comment on the sample size.

Extras:

  • We’ll see in a minute whether we should have been concerned.

  • The Sport group is, to my mind, close enough to normal given its slightly larger sample size and only slightly longer upper tail. My histogram looks bimodal, but only slightly so, and roughly symmetric in shape, so I would guess that the sample size of 15 will take care of that.

  1. Obtain a bootstrap sampling distribution of the sample mean for the PSSafter values in the Control group. From this distribution, do you think your \(t\)-test is reasonable? Explain briefly. (You may assume that we are happy with the distribution of PSSafter values in the Sport group.)

This is a lot like the two-sample one in the lecture notes: grab the values whose distribution you want to assess, and save them somewhere:

PrisonStress %>% filter(Group == "Control") -> controls

Then repeatedly take bootstrap samples from the PSSafter values in the dataframe you just made, and draw a plot such as a histogram:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(controls$PSSafter, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

Your graph may look different from mine (a bit, not substantially) because randomness.

Then make a call about whether you think this looks normal. I don’t mind about your call; I care about your reason for making your call, so one of these:

  • This looks a little skewed to the left still, so we should not have done a \(t\)-test here.

  • This looks very close to normal, so the \(t\)-test is fine.

The call with a good reason was the last point (when this was an assignment question). Having drawn a normal quantile plot, my take is that the slight but clear pattern of left-skewness is clear enough to cause me to worry, but you can also make the case that the points are close enough to the line and that we are close enough to normal.

There is, as you see, still a judgement call to make, but this one is easier than thinking about a boxplot and a sample size (there is only one thing to think about here: “is this plot normal or close to it?”).

Extra: A normal quantile plot gives us a more detailed picture. It doesn’t make the decision for us, but it gives us the material that we can use to make the decision, or that we can use to discuss the decision in our write-up.4 We haven’t seen this plot in lecture yet (I’m now realizing that I could stand to show it to you earlier in the course), but if you want a preview:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(controls$PSSafter, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

I reran the simulation with 10,000 bootstrap samples because my experience with this plot is that you can otherwise get led astray by the tails.

The idea is that if the points are on the line, the distribution is normal, and if they deviate from it in some systematic way, that tells you something about how the distribution fails to be normal. This one has a slight downward-opening curve, which means if anything the distribution is skewed to the left.5 Now you can go back to your histogram and see whether you see a little bit of left-skewedness there.

My take is that the distribution is not quite normal, but it is very close, and I would be happy to use a \(t\)-test here.

Home prices

A realtor kept track of the asking prices of 37 homes for sale in West Lafayette, Indiana, in a particular year. The asking prices are in http://ritsokiguess.site/datafiles/homes.csv. There are two columns, the asking price (in $) and the number of bedrooms that home has (either 3 or 4, in this dataset). The realtor was interested in whether the mean asking price for 4-bedroom homes was bigger than for 3-bedroom homes.

  1. Read in and display (some of) the data.

The exact usual. Your choice of name for the dataframe, as ever:

my_url <- "http://ritsokiguess.site/datafiles/homes.csv"
asking <- read_csv(my_url)
Rows: 37 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): price, bdrms

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
asking

There are indeed 37 homes; the first of them have 4 bdrms, and the ones further down, if you scroll, have three. The price column does indeed look like asking prices of homes for sale.6

  1. Draw a suitable graph of these data. Hint: if you do the obvious thing, you’ll get a graph that makes no sense. What happened, and how can you fix it up? The warning you might get on your graph will give you a hint.

Two groups of prices to compare, or one quantitative column and one column that appears to be categorical (it’s actually a number, but it’s playing the role of a categorical or grouping variable). So a boxplot. This requires care, though; if you do it without thinking you’ll get this:

ggplot(asking, aes(x = bdrms, y = price)) + geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

This makes no sense — there are supposed to be two groups of houses, and this plot has only one. What happened? The warning is the clue: the number of bedrooms looks quantitative (“continuous”) and so ggplot has tried (and failed) to treat it as such.

The perhaps most direct way around this is to take the warning message at face value and add bdrms as a group, thus:

ggplot(asking, aes(x = bdrms, y = price, group = bdrms)) + geom_boxplot()

and that works (as you will see below, it is effectively the same as the other methods that require a bit more thought).

You might be thinking that this is something like black magic, so I offer another idea where you have a fighting chance of understanding what is being done.

The problem is that bdrms looks like a quantitative variable (it has values 3 and 4 that are numbers), but we want it to be treated as a categorical variable. The easiest way to turn it into one is via factor, like this:

ggplot(asking, aes(x = factor(bdrms), y = price)) + geom_boxplot()

If the funny label on the \(x\)-axis bothers you, and it probably should,7 define a new variable first that is the factor version of bdrms. You can overwrite the old bdrms since we will not need the number as a number anywhere in this question:8

asking %>% 
  mutate(bdrms = factor(bdrms)) -> asking
ggplot(asking, aes(x = bdrms, y = price)) + geom_boxplot()

and that works smoothly.9

As a very quick extra: factor(bdrms) and group = bdrms both correctly give two boxplots side by side, but if you look carefully, the shaded grey area in the background of the graph is slightly different in each case. The group = way still treats bdrms as quantitative, and the \(x\)-axis reflects that (there is an axis “tick” at 3.5 bedrooms), but the factor(bdrms) plot treats the made-categorical bdrms as a genuine categorical variable with the values 3 and 4 and nothing else (the \(x\)-axis only has ticks at 3 and 4). From that point of view, the group = bdrms plot is a bit of a hack: it makes the boxplots come out right without fixing up the \(x\)-axis.

  1. Comment briefly on your plot. Does it suggest an answer to the realtor’s question? Do you have any doubts about the appropriateness of a \(t\)-test in this situation? Explain briefly. (Hint: your plot should have two groups. If it only has one, make sure you have asked a TA for help to get the right graph.)

It seems pretty clear that the average (on this plot, median) asking price for 4-bedroom houses is higher than for 3-bedroom houses. However, for a \(t\)-test to be appropriate, we need approximately normal distributions within each group of asking prices, and you can reasonably say that we do not: both distributions of asking prices are skewed to the right, and the 3-bedroom asking prices have three outliers at the top end.10

The other thing you need to consider is sample size: there are 37 houses altogether, so about 20 in each group:

asking %>% count(bdrms)

Thus the Central Limit Theorem will offer some help, but you could reasonably argue that even a sample size of 23 won’t be enough to fix up that skewness and those outliers in the 3-bedroom group.

  1. Sometimes prices work better on a log scale. This is because percent changes in prices are often of more interest than absolute dollar-value changes. Re-draw your plot using logs of asking prices. (In R, log() takes natural (base \(e\)) logs, which are fine here.) Do you like the shapes of the distributions better? Hint: you have a couple of options. One is to use the log right in your plotting (or, later, testing) functions. Another is to define a new column containing the log-prices and work with that.

You can put the log right in the ggplot command, thus:

ggplot(asking, aes(x = bdrms, y = log(price))) + geom_boxplot()

These look a lot better. The 4-bedroom distribution is close to symmetric and the 3-bedroom distribution is much less skewed (and has lost its outliers).

For this, and the sample sizes we have, I would now have no problem at all with a \(t\)-test.

The other way to do this is to make a new column that has the log-price in it:

asking %>% 
  mutate(log_price = log(price)) -> asking

and then make the plot:

ggplot(asking, aes(x = bdrms, y = log_price)) + geom_boxplot()

Both ways come out the same, and are equally good.

For the second way, it is better to save a dataframe with the log-prices in it and then make a plot, because we will be using the log-prices in our hypothesis test in a moment. If you use a pipeline here, like this:

asking %>% 
  mutate(log_price = log(price)) %>% 
  ggplot(aes(x = bdrms, y = log_price)) + geom_boxplot()

it works here, but you will have to define the log-prices again below. If you don’t see that now, that’s OK, but when you come to do the \(t\)-test with the log-prices in the next part, you ought to realize that you are doing something inefficient by calculating the log-prices again, so you should come back here and save the dataframe with the log-prices in it so that you don’t have to calculate them again. Or, I guess, use the log-prices directly in the \(t\)-test, but it seems odd to do one thing one way and the other thing a different way.

  1. Run a suitable \(t\)-test to compare the log-prices. What do you conclude? Hint: as for the graph in the previous part, you can use log directly in t.test, or use the new columns with the log-prices in them (if you did that).

Bear in mind what the realtor wants to know: whether the mean (log-) price is higher for 4-bedroom houses vs. 3-bedroom houses. This was something the realtor was curious about before they even looked at the data, so a one-sided test is appropriate. 3 is less than 4, so the alternative will be "less". Once again, you can put the log directly into the t.test, or use a column of log-prices that you create (such as the one you did for the boxplot, if you did that). Thus, two possibilities are:

t.test(log(price)~bdrms, data = asking, alternative = "less")

    Welch Two Sample t-test

data:  log(price) by bdrms
t = -5.1887, df = 30.59, p-value = 6.481e-06
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
       -Inf -0.4139356
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

and (my new column was called log_price):

t.test(log_price ~ bdrms, data = asking, alternative = "less")

    Welch Two Sample t-test

data:  log_price by bdrms
t = -5.1887, df = 30.59, p-value = 6.481e-06
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
       -Inf -0.4139356
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

The P-value and conclusion are the same either way. The P-value is 0.0000065, way less than 0.05, so there is no doubt that the mean (log-) asking price is higher for 4-bedroom homes than it is for 3-bedroom homes.

Side note: t.test is more forgiving than ggplot was with bdrms. Before, we had to wrap it in factor to get it treated as a categorical variable. It is reasonable enough to do that here as well (it works either way), and using factor(bdrms) shows that you are suspecting that there might be a problem again, which is intelligent. t.test, however, like other things from the early days of R,11 is more forgiving: it uses the distinct values of the variable on the right of the squiggle (bdrms) to make groups, whether they are text or numbers. Since the two-sample \(t\)-test is for comparing exactly two groups, it will complain if bdrms has more than two distinct values, but here we are good.

The other thing you should consider is whether we should have done a Welch or a pooled test. This one is, as you see, Welch, but a pooled test would be better if the two groups of log-prices had equal spreads. Go back and look at the last boxplot you did: on the log scale, the two spreads do actually look pretty similar.12 So we could also have done the pooled test. My guess (I haven’t looked at the results yet as I type this) is that the results will be almost identical in fact:

t.test(log_price ~ bdrms, data = asking, alternative = "less", var.equal = TRUE)

    Two Sample t-test

data:  log_price by bdrms
t = -5.0138, df = 35, p-value = 7.693e-06
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
       -Inf -0.4077387
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

The test statistic and P-value are very close, and the conclusion is identical, so it didn’t matter which test you used. But the best answer will at least consider whether a pooled or a Welch test is the better one to use.

Extra: as I originally conceived this question, I was going to have you finish by finding a confidence interval to quantify how different the mean (log-) prices are. The problem with that here is that you get, if you re-do it two-sided, a confidence interval for the difference in mean log-prices, not an easy thing to interpret:

t.test(log_price ~ bdrms, data = asking)

    Welch Two Sample t-test

data:  log_price by bdrms
t = -5.1887, df = 30.59, p-value = 1.296e-05
alternative hypothesis: true difference in means between group 3 and group 4 is not equal to 0
95 percent confidence interval:
 -0.8568304 -0.3731165
sample estimates:
mean in group 3 mean in group 4 
       11.82912        12.44410 

Some thinking required: this is a difference of means of logs of prices. How can we say something about actual prices here? Let’s ignore the mean part for now; the scale these things are on is log-prices. What do we know about differences of logs? Haul out some math here:13

\[ \log a - \log b = \log(a/b), \] so

\[ \exp (\log a - \log b) = a/b.\]

So how does this apply to our confidence interval? What it says is that if you take the confidence interval for the difference in means of log-prices, and exp its endpoints, what you get is a confidence interval for the ratio of means of the actual prices:

ci_log <- c(-0.8568304,-0.3731165)
exp(ci_log)
[1] 0.4245055 0.6885850

This says that the average asking price for the 3-bedroom houses is between 42 and 69 percent of the average asking price for the 4-bedroom houses. Thus the asking prices for the 3-bedroom houses are quite a bit less on average.14 Thus it is not at all surprising that the P-value was so small, whether you did pooled or Welch.

Footnotes

  1. which we come back to later in the course.↩︎

  2. Yes, I know, but that’s what it is.↩︎

  3. The P-values are in the last column.↩︎

  4. The alternative test here, that we see later, is called Mood’s Median Test, but my suspicion is that this won’t be quite significant either. When the normality is not too bad, as here, you would expect the results from a \(t\)-test and Mood’s median test to be pretty similar.↩︎

  5. The rationale is that y on the plot is what we actually observed and x is what we would have expected to observe if the distribution were exactly normal. The low values we observed are a bit too low, and the high ones are not quite high enough to be normal, which means that the distribution is a bit too spread out at the low end and a bit too bunched up at the high end: that is, slightly skewed to the left.↩︎

  6. At least, for somewhere that is not Toronto!↩︎

  7. Note that the group idea I showed you first gives you a perfectly reasonable axis label.↩︎

  8. I am not asking for anything like the mean number of bedrooms anywhere here.↩︎

  9. Turning the number of bedrooms into text, via as.character(bdrms), also works. This is actually how we have been handling categorical variables so far: reading them in as text. Usually, though, they look like text. Here they don’t.↩︎

  10. I would be happy to call these genuine outliers, because there are only three of them, and they look a little separated from the whisker, so that it is reasonable to say that these three asking prices are bigger than the others.↩︎

  11. You might see cbind from base R in other courses; I use it because it is more forgiving than the tidyverse ways of gluing things together.↩︎

  12. The spread of actual asking prices without taking logs is bigger for the 4-bedroom houses, so if we had been willing to do a \(t\)-test without taking logs first, we should definitely have preferred the Welch test. The effect of taking logs is to bring the higher values down, compared to the lower ones, which made both distributions less right-skewed and also made the spreads more equal. For this reason, the log transformation is often useful: it can both equalize spread and make things look more normal, all at once.↩︎

  13. For those of us old enough to remember times before calculators (which I am, just), this is how we would do division if we couldn’t do it by long division. We used to have books called “log tables” in which you could look up base-10 logs of anything. Look up the log of the thing on the top of the division, look up the log of the thing on the bottom, subtract, and then turn to the “antilog” tables and look up the result there to find the answer. exp is playing the role of antilog here. Example: to work out \(4/3\), look up the (base 10) log of 4, which is 0.602, and the log of 3, which is 0.477. Subtract to get 0.125. I happen to remember that the base 10 log of 1.3 is 0.114, so \(4/3\) is a bit bigger than 1.3, as it is. An antilog table would give the answer more precisely.↩︎

  14. This is what I meant earlier when I said that with logs, percent changes are the ones of interest.↩︎