---
title: "Normal quantile plots"
---
## The normal quantile plot
- see that normal distributions of data (or being normal enough)
important
- only tools we have to assess this are histograms and maybe boxplots
- a better tool is **normal quantile plot**:
- plot data against what you expect if data actually normal
- look for points to follow a straight line, at least approx
- `ggplot` code: `aes` `sample`; geoms `stat_qq` and `stat_qq_line`
## Packages
The usual:
```{r}
library(tidyverse)
```
## Kids learning to read
```{r inference-4a-R-1, echo=FALSE, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
glimpse(kids)
```
```{r inference-4a-R-2}
ggplot(kids, aes(x = group, y = score)) + geom_boxplot()
```
Each group looks more or less normal, or at least close to symmetric.
## Get the groups separately
```{r inference-4a-R-3}
kids %>% filter(group == "t") -> treatment
kids %>% filter(group == "c") -> control
```
to check
```{r inference-4a-R-4}
treatment %>% count(group)
control %>% count(group)
```
## The treatment group
```{r inference-4a-R-5, fig.height=4.5}
ggplot(treatment, aes(sample = score)) +
stat_qq() + stat_qq_line()
```
only problem here is lowest value a little too low (mild outlier).
## Control group
```{r inference-4a-R-6, fig.height=4}
ggplot(control, aes(sample = score)) +
stat_qq() + stat_qq_line()
```
This time, highest value a little too high, but again, no real problem
with normality.
## Facetting more than one sample
Use the whole data set and facet by groups
```{r inference-4a-R-7, fig.height=4.5}
ggplot(kids, aes(sample = score)) +
stat_qq() + stat_qq_line() + facet_wrap(~group)
```
## Blue Jays attendances, skewed to right
```{r inference-4a-R-8, echo=FALSE, message=FALSE}
jays <- read_csv("jays15-home.csv")
```
```{r inference-4a-R-9}
ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6)
```
## On a normal quantile plot
```{r inference-4a-R-10, fig.height=3.5}
ggplot(jays, aes(sample = attendance)) +
stat_qq() + stat_qq_line()
```
- Attendances at low end too bunched up: skewed to right.
- Right-skewness can also show up as highest values being too high, or
as a curved pattern in the points.
## More normal quantile plots
- How straight does a normal quantile plot have to be?
- There is randomness in real data, so even a normal quantile plot
from normal data won't look perfectly straight.
- With a small sample, can look not very straight even from normal
data.
- Looking for systematic departure from a straight line; random
wiggles ought not to concern us.
- Look at some examples where we know the answer, so that we can see
what to expect.
## Normal data, large sample
```{r set-seed, echo=F}
set.seed(457299)
```
```{r inference-4a-R-11, fig.height=4.5}
d <- tibble(x=rnorm(200))
ggplot(d, aes(x=x)) + geom_histogram(bins=10)
```
## The normal quantile plot
```{r inference-4a-R-12, fig.height=4.5}
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
```
## Normal data, small sample
```{r inference-4a-R-13, echo=F}
set.seed(457299)
```
- Not so convincingly normal, but not obviously skewed:
```{r normal-small, fig.height=4.5}
d <- tibble(x=rnorm(20))
ggplot(d, aes(x=x)) + geom_histogram(bins=5)
```
## The normal quantile plot
Good, apart from the highest and lowest points being slightly off. I'd
call this good:
```{r inference-4a-R-14, fig.height=4.5}
ggplot(d, aes(sample=x)) + stat_qq() + stat_qq_line()
```
## Chi-squared data, *df* = 10
Somewhat skewed to right:
```{r inference-4a-R-15, fig.height=4.5}
d <- tibble(x=rchisq(100, 10))
ggplot(d,aes(x=x)) + geom_histogram(bins=10)
```
## The normal quantile plot
Somewhat opening-up curve:
```{r inference-4a-R-16, fig.height=4.5}
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
```
## Chi-squared data, df = 3
Definitely skewed to right:
```{r chisq-small-df, fig.height=4.5}
d <- tibble(x=rchisq(100, 3))
ggplot(d, aes(x=x)) + geom_histogram(bins=10)
```
## The normal quantile plot
Clear upward-opening curve:
```{r inference-4a-R-17, fig.height=4.5}
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
```
## t-distributed data, df = 3
Long tails (or a very sharp peak):
```{r t-small, fig.height=4.5}
d <- tibble(x=rt(300, 3))
ggplot(d, aes(x=x)) + geom_histogram(bins=15)
```
## The normal quantile plot
Low values too low and high values too high for normal.
```{r inference-4a-R-18, fig.height=4.5}
ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line()
```
## Summary
On a normal quantile plot:
- points following line (with some small wiggles): normal.
- kind of deviation from a straight line indicates kind of
nonnormality:
- a few highest point(s) too high and/or lowest too low: outliers
- else see how points at each end off the line:
| | High points | |
|----------------|-------------|--------------|
| **Low points** | **Too low** | **Too high** |
| **Too low** | Skewed left | Long tails |
| **Too high** | Short tails | Skewed right |
- short-tailed distribution OK for $t$ (mean still good), but others
problematic (depending on sample size).