library(tidyverse)
12 Normal quantile plots
12.1 Lengths of heliconia flowers
The tropical flower Heliconia is fertilized by hummingbirds, a different species for each variety of Heliconia. Over time, the lengths of the flowers and the form of the hummingbirds’ beaks have evolved to match each other. The length of the Heliconia flower is therefore an important measurement. Does it have a normal distribution for each variety?
The data set at http://ritsokiguess.site/datafiles/heliconia.csv contains the lengths (in millimetres) of samples of flowers from each of three varieties of Heliconia: bihai, caribaea red, and caribaea yellow.
Read the data into R. There are different numbers of length measurements for each variety. How does this show up in the data frame? (Look at all the rows, not just the first ten.)
Make a normal quantile plot for the variety bihai.
Make a normal quantile plot for the variety Caribaea red (note that the variable name in the data frame has an underscore in it).
Make a normal quantile plot for the variety Caribaea yellow (this also has an underscore in it).
Which of the three varieties is closest to having a normal distribution? Explain (very) briefly.
For each of the two other varieties, apart from the one you mentioned in the last part, describe briefly how their distributions fail to be normal.
12.2 Ferritin and normality
In the lecture notes, we looked at some data on different athletes from the Australian Institute of Sport. This data set can be found at http://ritsokiguess.site/datafiles/ais.txt. Recall that the values are separated by tabs. In this question, we will assess one of the variables in the data set for normality.
Read the data set into R.
One of the variables,
Ferr
, is a measurement of Ferritin for each athlete. Obtain a normal quantile plot of the Ferritin values, for all the athletes together. What do you conclude about the shape of the distribution? Explain briefly.It is possible that the shape you found in the previous part is because the athletes from all the different sports were mixed together. Use
ggplot
to obtain one normal quantile plot for each sport, collected together on one plot.Looking at the plots in the previous part, would you say that the Ferritin values for each sport individually have a more normal shape than they do for all the sports together? Explain briefly.
My solutions follow:
12.3 Lengths of heliconia flowers
The tropical flower Heliconia is fertilized by hummingbirds, a different species for each variety of Heliconia. Over time, the lengths of the flowers and the form of the hummingbirds’ beaks have evolved to match each other. The length of the Heliconia flower is therefore an important measurement. Does it have a normal distribution for each variety?
The data set at http://ritsokiguess.site/datafiles/heliconia.csv contains the lengths (in millimetres) of samples of flowers from each of three varieties of Heliconia: bihai, caribaea red, and caribaea yellow.
- Read the data into R. There are different numbers of length measurements for each variety. How does this show up in the data frame? (Look at all the rows, not just the first ten.)
Solution
The usual read_csv
:
<- "http://ritsokiguess.site/datafiles/heliconia.csv"
my_url <- read_csv(my_url) heliconia
Rows: 23 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): bihai, caribaea_red, caribaea_yellow
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I suggested to look at all the rows. Here’s why:
heliconia
The varieties with fewer values have missings (NAs) attached to the end. This is because all the columns in a data frame have to have the same number of values. (The missings won’t impact what we do below — we get a warning but not an error, and the plots are the same as they would be without the missings — but you might be aesthetically offended by them, in which case you can read what I do later on.)
\(\blacksquare\)
- Make a normal quantile plot for the variety bihai.
Solution
There’s a certain amount of repetitiveness here (that we work around later):
ggplot(heliconia,aes(sample=bihai))+stat_qq()+stat_qq_line()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_qq()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_qq_line()`).
I’m saving the comments until we’ve seen all three, other than to say that the warning here is the missing values we found above.
\(\blacksquare\)
- Make a normal quantile plot for the variety Caribaea red (note that the variable name in the data frame has an underscore in it).
Solution
Same idea again:
ggplot(heliconia,aes(sample=caribaea_red))+stat_qq()+stat_qq_line()
\(\blacksquare\)
- Make a normal quantile plot for the variety Caribaea yellow (this also has an underscore in it).
Solution
And, one more time:
ggplot(heliconia,aes(sample=caribaea_yellow))+stat_qq()+stat_qq_line()
Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_qq()`).
Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_qq_line()`).
I did a lot of copying and pasting there. (The warning is missing values again.)
\(\blacksquare\)
- Which of the three varieties is closest to having a normal distribution? Explain (very) briefly.
Solution
Look at the three plots, and see which one stays closest to the line. To my mind, this is clearly the last one, Caribaea yellow. So your answer ought to be “Caribaea yellow, because the points are closest to the line”. This, I would say, is acceptably close to normal, so using a \(t\)-test here would be fine. The answer “the last one” is not quite complete, because I asked you which variety, so your answer needs to name a variety.
\(\blacksquare\)
- For each of the two other varieties, apart from the one you mentioned in the last part, describe briefly how their distributions fail to be normal.
Solution
Let’s look at bihai first. I see this one as an almost classic curve: the points are above the line, then below, then above again. If you look at the data scale (\(y\)-axis), the points are too bunched up to be normal at the bottom, and too spread out at the top: that is, skewed to the right. You might also (reasonably) take the view that the points at the bottom are close to the line (not sure about the very smallest one, though), but the points at the top are farther away, so that what we have here is two outliers at the top. I’m OK with that. It’s often difficult to distinguish between skewness and outliers (at the end of the long tail). What you conclude can often depend on how you look. We also had to look at the second plot, caribaea red. This is a rather strange one: the points veer away from the line at the ends, but look carefully: it is not outliers at both ends, but rather the points are too bunched up to be normal at both ends: that is, the distribution has short tails compared to the normal. It is something more like a uniform distribution, which has no tails at all, than a normal distribution, which won’t have outliers but it does have some kind of tails. So, “short tails”.
Extra: that’s all you needed, but I mentioned above that you might have been offended aesthetically by those missing values that were not really missing. Let’s see if we can do this aesthetically. As you might expect, it uses several of the tools from the “tidyverse”, some of which you will see later, but are getting an introduction to now.
First, tidy the data. The three columns of the data frame are all lengths, just lengths of different things, which need to be labelled. This is pivot_longer
from tidyr
:
%>%
heliconia pivot_longer(everything(), names_to="variety", values_to="length",
values_drop_na = TRUE) -> heliconia.long
heliconia.long
This is now aesthetic as well as tidy: all those NA
lines have gone (you can check that there are now \(16+23+15=54\) rows of actual data, as there should be). This was accomplished by the last thing in the pivot_longer
: “in the values (that is, the lengths), drop any missing values.”
Now, how to get a normal quantile plot for each variety? This is facet_wrap
on the end of the ggplot
again.
ggplot(heliconia.long, aes(sample=length)) +
stat_qq() + stat_qq_line() +
facet_wrap(~variety, scales="free")
These are a bit elongated vertically. The scales="free"
allows a different vertical scale for each plot (otherwise there would be one vertical scale for all three plots); I decided that was best here since the typical lengths for the three varieties are different. Caribaea yellow is more or less straight, bihai has outliers (and may also be curved), caribaea red has that peculiar S-bend shape.
I didn’t really like the vertical elongation. I’d rather have the plots be almost square, which they would be if we put them in three cells of a \(2 \times 2\) grid. facet_wrap
has nrow
and ncol
which you can use one or both of to make this happen. This creates an array of plots with two columns and as many rows as needed:
ggplot(heliconia.long, aes(sample=length)) +
stat_qq() + stat_qq_line() +
facet_wrap(~variety, scales="free", ncol=2)
I think the squarer plots make it easier to see the shape of these: curved, S-bend, straightish. Almost the same code will get a histogram for each variety, which I’ll also make squarish:
ggplot(heliconia.long, aes(x = length)) +
geom_histogram(bins = 5) + facet_wrap(~variety, scales = "free", ncol=2)
bihai has those two outliers, caribaea red has no tails to speak of (or you might say “it’s bimodal”, which would be another explanation of the pattern on the normal quantile plot1) and caribaea yellow is shoulder-shruggingly normal (I looked at that and said, “well, I guess it’s normal”.) After you’ve looked at the normal quantile plots, you see what a crude tool a histogram is for assessing normality.
\(\blacksquare\)
12.4 Ferritin and normality
In the lecture notes, we looked at some data on different athletes from the Australian Institute of Sport. This data set can be found at http://ritsokiguess.site/datafiles/ais.txt. Recall that the values are separated by tabs. In this question, we will assess one of the variables in the data set for normality.
- Read the data set into R.
Solution
read_tsv
is the right thing:
="http://ritsokiguess.site/datafiles/ais.txt"
my_url=read_tsv(my_url) athletes
Rows: 202 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Sex, Sport
dbl (11): RCC, WCC, Hc, Hg, Ferr, BMI, SSF, %Bfat, LBM, Ht, Wt
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
athletes
I listed the data to check that I had it right, but I didn’t ask you to. (If you didn’t have it right, that will show up soon enough.)
\(\blacksquare\)
- One of the variables,
Ferr
, is a measurement of Ferritin for each athlete. Obtain a normal quantile plot of the Ferritin values, for all the athletes together. What do you conclude about the shape of the distribution? Explain briefly.
Solution
As you would expect:
ggplot(athletes, aes(sample=Ferr))+
stat_qq()+stat_qq_line()
This is almost a classic right skew: the values are too bunched up at the bottom and too spread out at the top. The curved shape should be making you think “skewed” and then you can work out which way it’s skewed.
\(\blacksquare\)
- It is possible that the shape you found in the previous part is because the athletes from all the different sports were mixed together. Use
ggplot
to obtain one normal quantile plot for each sport, collected together on one plot.
Solution
Your previous plot had all the sports mixed together. To that you add something that will put each sport in its own facet:
ggplot(athletes,aes(sample=Ferr))+stat_qq()+stat_qq_line()+
facet_wrap(~Sport)
\(\blacksquare\)
- Looking at the plots in the previous part, would you say that the Ferritin values for each sport individually have a more normal shape than they do for all the sports together? Explain briefly.
Solution
There are a couple of ways you can go, and as ever I’m looking mostly for consistency of argument. The two major directions you can go are (i) most of these plots are still curved the same way as the previous one, and (ii) they are mostly straighter than they were before. Possible lines of argument include that pretty much all of these plots are right-skewed still, with the same upward-opening curve. Pretty much the only one that doesn’t is Gymnastics, for which there are only four observations, so you can’t really tell. So, by this argument, Ferritin just does have a right-skewed distribution, and breaking things out by sport doesn’t make much difference to that. Or, you could go another way and say that the plot of all the data together was very curved, and these plots are much less curved, that is to say, much less skewed. Some of them, such as basketball and netball, are almost straight, and they are almost normally distributed. Some of the distributions, such as track sprinting (TSprnt
), are definitely still right-skewed, but not as seriously so as before. Decide what you think and then discuss how you see it.
\(\blacksquare\)
If you have studied a thing called kurtosis, the fourth moment about the mean, you’ll know that this measures both tail length and peakedness, so a short-tailed distribution also has a strong peak. Or, maybe, in this case, two strong peaks.↩︎