```
library(ggbiplot)
library(MASS)
library(tidyverse)
library(car)
```

# 35 Discriminant analysis

Packages for this chapter:

(Note: `ggbiplot`

loads `plyr`

, which overlaps a lot with `dplyr`

(`filter`

, `select`

etc.). We want the `dplyr`

stuff elsewhere, so we load `ggbiplot`

*first*, and the things in `plyr`

get hidden, as shown in the Conflicts. This, despite appearances, is what we want.)

## 35.1 Telling whether a banknote is real or counterfeit

* A Swiss bank collected a number of known counterfeit (fake) bills over time, and sampled a number of known genuine bills of the same denomination. Is it possible to tell, from measurements taken from a bill, whether it is genuine or not? We will explore that issue here. The variables measured were:

length

right-hand width

left-hand width

top margin

bottom margin

diagonal

Read in the data from link, and check that you have 200 rows and 7 columns altogether.

Run a multivariate analysis of variance. What do you conclude? Is it worth running a discriminant analysis? (This is the same procedure as with basic MANOVAs before.)

Run a discriminant analysis. Display the output.

How many linear discriminants did you get? Is that making sense? Explain briefly.

* Using your output from the discriminant analysis, describe how each of the linear discriminants that you got is related to your original variables. (This can, maybe even should, be done crudely: “does each variable feature in each linear discriminant: yes or no?”.)

What values of your variable(s) would make

`LD1`

large and positive?* Find the means of each variable for each group (genuine and counterfeit bills). You can get this from your fitted linear discriminant object.

Plot your linear discriminant(s), however you like. Bear in mind that there is only one linear discriminant.

What kind of score on

`LD1`

do genuine bills typically have? What kind of score do counterfeit bills typically have? What characteristics of a bill, therefore, would you look at to determine if a bill is genuine or counterfeit?

## 35.2 Urine and obesity: what makes a difference?

A study was made of the characteristics of urine of young men. The men were classified into four groups based on their degree of obesity. (The groups are labelled `a, b, c, d`

.) Four variables were measured, `x`

(which you can ignore), pigment creatinine, chloride and chlorine. The data are in link as a `.csv`

file. There are 45 men altogether.

Yes, you may have seen this one before. What you found was something like this, probably also with the Box M test (which has a P-value that is small, but not small enough to be a concern):

```
<- "http://ritsokiguess.site/datafiles/urine.csv"
my_url <- read_csv(my_url) urine
```

```
Rows: 45 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): obesity
dbl (4): x, creatinine, chloride, chlorine
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
<- with(urine, cbind(creatinine, chlorine, chloride))
response .1 <- manova(response ~ obesity, data = urine)
urinesummary(urine.1)
```

```
Df Pillai approx F num Df den Df Pr(>F)
obesity 3 0.43144 2.2956 9 123 0.02034 *
Residuals 41
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Our aim is to understand why this result was significant.

Read in the data again (copy the code from above) and obtain a discriminant analysis.

How many linear discriminants were you expecting? Explain briefly.

Why do you think we should pay attention to the first two linear discriminants but not the third? Explain briefly.

Plot the first two linear discriminant scores (against each other), with each obesity group being a different colour.

* Looking at your plot, discuss how (if at all) the discriminants separate the obesity groups. (Where does each obesity group fall on the plot?)

* Obtain a table showing observed and predicted obesity groups. Comment on the accuracy of the predictions.

Do your conclusions from (here) and (here) appear to be consistent?

## 35.3 Understanding a MANOVA

One use of discriminant analysis is to understand the results of a MANOVA. This question is a followup to a previous MANOVA that we did, the one with two variables `y1`

and `y2`

and three groups `a`

through `c`

. The data were in link.

Read the data in again and run the MANOVA that you did before.

Run a discriminant analysis “predicting” group from the two response variables. Display the output.

* In the output from the discriminant analysis, why are there exactly two linear discriminants

`LD1`

and`LD2`

?* From the output, how would you say that the first linear discriminant

`LD1`

compares in importance to the second one`LD2`

: much more important, more important, equally important, less important, much less important? Explain briefly.Obtain a plot of the discriminant scores.

Describe briefly how

`LD1`

and/or`LD2`

separate the groups. Does your picture confirm the relative importance of`LD1`

and`LD2`

that you found back in part (here)? Explain briefly.What makes group

`a`

have a low score on`LD1`

? There are two steps that you need to make: consider the means of group`a`

on variables`y1`

and`y2`

and how they compare to the other groups, and consider how`y1`

and`y2`

play into the score on`LD1`

.Obtain predictions for the group memberships of each observation, and make a table of the actual group memberships against the predicted ones. How many of the observations were wrongly classified?

## 35.4 What distinguishes people who do different jobs?

244^{1} people work at a certain company. They each have one of three jobs: customer service, mechanic, dispatcher. In the data set, these are labelled 1, 2 and 3 respectively. In addition, they each are rated on scales called `outdoor`

, `social`

and `conservative`

. Do people with different jobs tend to have different scores on these scales, or, to put it another way, if you knew a person’s scores on `outdoor`

, `social`

and `conservative`

, could you say something about what kind of job they were likely to hold? The data are in link.

Read in the data and display some of it.

Note the types of each of the variables, and create any new variables that you need to.

Run a multivariate analysis of variance to convince yourself that there are some differences in scale scores among the jobs.

Run a discriminant analysis and display the output.

Which is the more important,

`LD1`

or`LD2`

? How much more important? Justify your answer briefly.Describe what values for an individual on the scales will make each of

`LD1`

and`LD2`

high.The first group of employees, customer service, have the highest mean on

`social`

and the lowest mean on both of the other two scales. Would you expect the customer service employees to score high or low on`LD1`

? What about`LD2`

?Plot your discriminant scores (which you will have to obtain first), and see if you were right about the customer service employees in terms of

`LD1`

and`LD2`

. The job names are rather long, and there are a lot of individuals, so it is probably best to plot the scores as coloured circles with a legend saying which colour goes with which job (rather than labelling each individual with the job they have).* Obtain predicted job allocations for each individual (based on their scores on the three scales), and tabulate the true jobs against the predicted jobs. How would you describe the quality of the classification? Is that in line with what the plot would suggest?

Consider an employee with these scores: 20 on

`outdoor`

, 17 on`social`

and 8 on`conservative`

What job do you think they do, and how certain are you about that? Use`predict`

, first making a data frame out of the values to predict for.Since I am not making you hand this one in, I’m going to keep going. Re-run the analysis to incorporate cross-validation, and make a table of the predicted group memberships. Is it much different from the previous one you had? Why would that be?

## 35.5 Observing children with ADHD

A number of children with ADHD were observed by their mother or their father (only one parent observed each child). Each parent was asked to rate occurrences of behaviours of four different types, labelled `q1`

through `q4`

in the data set. Also recorded was the identity of the parent doing the observation for each child: 1 is father, 2 is mother.

Can we tell (without looking at the `parent`

column) which parent is doing the observation? Research suggests that rating the degree of impairment in different categories depends on who is doing the rating: for example, mothers may feel that a child has difficulty sitting still, while fathers, who might do more observing of a child at play, might think of such a child as simply being “active” or “just being a kid”. The data are in link.

Read in the data and confirm that you have four ratings and a column labelling the parent who made each observation.

Run a suitable discriminant analysis and display the output.

Which behaviour item or items seem to be most helpful at distinguishing the parent making the observations? Explain briefly.

Obtain the predictions from the

`lda`

, and make a suitable plot of the discriminant scores, bearing in mind that you only have one`LD`

. Do you think there will be any misclassifications? Explain briefly.Obtain the predicted group memberships and make a table of actual vs. predicted. Were there any misclassifications? Explain briefly.

Re-run the discriminant analysis using cross-validation, and again obtain a table of actual and predicted parents. Is the pattern of misclassification different from before? Hints: (i) Bear in mind that there is no

`predict`

step this time, because the cross-validation output includes predictions; (ii) use a different name for the predictions this time because we are going to do a comparison in a moment.Display the original data (that you read in from the data file) side by side with two sets of posterior probabilities: the ones that you obtained with

`predict`

before, and the ones from the cross-validated analysis. Comment briefly on whether the two sets of posterior probabilities are similar. Hints: (i) use`data.frame`

rather than`cbind`

, for reasons that I explain elsewhere; (ii) round the posterior probabilities to 3 decimals before you display them. There are only 29 rows, so look at them all. I am going to add the`LD1`

scores to my output and sort by that, but you don’t need to. (This is for something I am going to add later.)Row 17 of your (original) data frame above, row 5 of the output in the previous part, is the mother that was misclassified as a father. Why is it that the cross-validated posterior probabilities are 1 and 0, while the previous posterior probabilities are a bit less than 1 and a bit more than 0?

Find the parents where the cross-validated posterior probability of being a father is “non-trivial”: that is, not close to zero and not close to 1. (You will have to make a judgement about what “close to zero or 1” means for you.) What do these parents have in common, all of them or most of them?

## 35.6 Growing corn

A new type of corn seed has been developed. The people developing it want to know if the type of soil the seed is planted in has an impact on how well the seed performs, and if so, what kind of impact. Three outcome measures were used: the yield of corn produced (from a fixed amount of seed), the amount of water needed, and the amount of herbicide needed. The data are in link. 32 fields were planted with the seed, 8 fields with each soil type.

Read in the data and verify that you have 32 observations with the correct variables.

Run a multivariate analysis of variance to see whether the type of soil has any effect on any of the variables. What do you conclude from it?

Run a discriminant analysis on these data, “predicting” soil type from the three response variables. Display the results.

* Which linear discriminants seem to be worth paying attention to? Why did you get three linear discriminants? Explain briefly.

Which response variables do the important linear discriminants depend on? Answer this by extracting something from your discriminant analysis output.

Obtain predictions for the discriminant analysis. (You don’t need to do anything with them yet.)

Plot the first two discriminant scores against each other, coloured by soil type. You’ll have to start by making a data frame containing what you need.

On your plot that you just made, explain briefly how

`LD1`

distinguishes at least one of the soil types.On your plot, does

`LD2`

appear to do anything to separate the groups? Is this surprising given your earlier findings? Explain briefly.Make a table of actual and predicted

`soil`

group. Which soil type was classified correctly the most often?

## 35.7 Understanding athletes’ height, weight, sport and gender

On a previous assignment, we used MANOVA on the athletes data to demonstrate that there was a significant relationship between the combination of the athletes’ height and weight, with the sport they play and the athlete’s gender. The problem with MANOVA is that it doesn’t give any information about the *kind* of relationship. To understand that, we need to do discriminant analysis, which is the purpose of this question.

The data can be found at link.

Once again, read in and display (some of) the data, bearing in mind that the data values are separated by

*tabs*. (This ought to be a free two marks.)Use

`unite`

to make a new column in your data frame which contains the sport-gender*combination*. Display it. (You might like to display only a few columns so that it is clear that you did the right thing.) Hint: you’ve seen`unite`

in the peanuts example in class.Run a discriminant analysis “predicting” sport-gender combo from height and weight. Display the results. (No comment needed yet.)

What kind of height and weight would make an athlete have a large (positive) score on

`LD1`

? Explain briefly.Make a guess at the sport-gender combination that has the

*highest*score on LD1. Why did you choose the combination you did?*What combination of height and weight would make an athlete have a*small* (that is, very negative) score on LD2? Explain briefly.Obtain predictions for the discriminant analysis, and use these to make a plot of

`LD1`

score against`LD2`

score, with the individual athletes distinguished by what sport they play and gender they are. (You can use colour to distinguish them, or you can use shapes. If you want to go the latter way, there are clues in my solutions to the MANOVA question about these athletes.)Look on your graph for the four athletes with the smallest (most negative) scores on

`LD2`

. What do they have in common? Does this make sense, given your answer to part (here)? Explain briefly.Obtain a (very large) square table, or a (very long) table with frequencies, of actual and predicted sport-gender combinations. You will probably have to make the square table very small to fit it on the page. For that, displaying the columns in two or more sets is OK (for example, six columns and all the rows, six more columns and all the rows, then the last five columns for all the rows). Are there any sport-gender combinations that seem relatively easy to classify correctly? Explain briefly.

My solutions follow:

## 35.8 Telling whether a banknote is real or counterfeit

* A Swiss bank collected a number of known counterfeit (fake) bills over time, and sampled a number of known genuine bills of the same denomination. Is it possible to tell, from measurements taken from a bill, whether it is genuine or not? We will explore that issue here. The variables measured were:

length

right-hand width

left-hand width

top margin

bottom margin

diagonal

- Read in the data from link, and check that you have 200 rows and 7 columns altogether.

Solution

Check the data file first. It’s aligned in columns, thus:

```
<- "http://ritsokiguess.site/datafiles/swiss1.txt"
my_url <- read_table(my_url) swiss
```

```
── Column specification ────────────────────────────────────────────────────────
cols(
length = col_double(),
left = col_double(),
right = col_double(),
bottom = col_double(),
top = col_double(),
diag = col_double(),
status = col_character()
)
```

` swiss`

Yep, 200 rows and 7 columns.

\(\blacksquare\)

- Run a multivariate analysis of variance. What do you conclude? Is it worth running a discriminant analysis? (This is the same procedure as with basic MANOVAs before.)

Solution

Small-m `manova`

will do here:

```
<- with(swiss, cbind(length, left, right, bottom, top, diag))
response .1 <- manova(response ~ status, data = swiss)
swisssummary(swiss.1)
```

```
Df Pillai approx F num Df den Df Pr(>F)
status 1 0.92415 391.92 6 193 < 2.2e-16 ***
Residuals 198
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(BoxM(response, swiss$status))`

```
Box's M Test
Chi-Squared Value = 121.8991 , df = 21 and p-value: 3.2e-16
```

That is a very significant Box’s M test, which means that we shouldn’t trust the MANOVA at all. It is only the fact that the MANOVA is *so* significant that provides any evidence that the discriminant analysis is worth doing.

Extra: you might be wondering whether you had to go to all that trouble to make the response variable. Would this work?

```
<- swiss %>% select(length:diag)
response2 .1a <- manova(response2 ~ status, data = swiss) swiss
```

`Error in model.frame.default(formula = response2 ~ status, data = swiss, : invalid type (list) for variable 'response2'`

No, because `response2`

needs to be an R `matrix`

, and it isn’t:

`class(response2)`

`[1] "tbl_df" "tbl" "data.frame"`

The error message was a bit cryptic (nothing unusual there), but a data frame (to R) is a special kind of `list`

, so that R didn’t like `response2`

being a data frame, which it thought was a list.

This, however, works, since it turns the data frame into a matrix:

```
<- swiss %>% select(length:diag) %>% as.matrix()
response4 .2a <- manova(response4 ~ status, data = swiss)
swisssummary(swiss.2a)
```

```
Df Pillai approx F num Df den Df Pr(>F)
status 1 0.92415 391.92 6 193 < 2.2e-16 ***
Residuals 198
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Anyway, the conclusion: the status of a bill (genuine or counterfeit) definitely has an influence on some or all of those other variables, since the P-value \(2.2 \times 10^{-16}\) (or less) is really small. So it is apparently worth running a discriminant analysis to figure out where the differences lie.

As a piece of strategy, for creating the response matrix, you can always either use `cbind`

, which creates a `matrix`

directly, or you can use `select`

, which is often easier but creates a data frame, and then turn *that* into a `matrix`

using `as.matrix`

. As long as you end up with a `matrix`

, it’s all good.

\(\blacksquare\)

- Run a discriminant analysis. Display the output.

Solution

Now we forget about all that `response`

stuff. For a discriminant analysis, the grouping variable (or combination of the grouping variables) is the “response”, and the quantitative ones are “explanatory”:

```
.3 <- lda(status ~ length + left + right + bottom + top + diag, data = swiss)
swiss.3 swiss
```

```
Call:
lda(status ~ length + left + right + bottom + top + diag, data = swiss)
Prior probabilities of groups:
counterfeit genuine
0.5 0.5
Group means:
length left right bottom top diag
counterfeit 214.823 130.300 130.193 10.530 11.133 139.450
genuine 214.969 129.943 129.720 8.305 10.168 141.517
Coefficients of linear discriminants:
LD1
length 0.005011113
left 0.832432523
right -0.848993093
bottom -1.117335597
top -1.178884468
diag 1.556520967
```

\(\blacksquare\)

- How many linear discriminants did you get? Is that making sense? Explain briefly.

Solution

I got one discriminant, which makes sense because there are two groups, and the smaller of 6 (variables, not counting the grouping one) and \(2-1\) is 1.

\(\blacksquare\)

- * Using your output from the discriminant analysis, describe how each of the linear discriminants that you got is related to your original variables. (This can, maybe even should, be done crudely: “does each variable feature in each linear discriminant: yes or no?”.)

Solution

This is the Coefficients of Linear Discriminants. Make a call about whether each of those coefficients is close to zero (small in size compared to the others), or definitely positive or definitely negative. These are judgement calls: either you can say that LD1 depends mainly on `diag`

(treating the other coefficients as “small” or close to zero), or you can say that `LD1`

depends on everything except `length`

.

\(\blacksquare\)

- What values of your variable(s) would make
`LD1`

large and positive?

Solution

Depending on your answer to the previous part: If you said that only `diag`

was important, `diag`

being large would make `LD1`

large and positive. If you said that everything but `length`

was important, then it’s a bit more complicated: `left`

and `diag`

large, `right`

, `bottom`

and `top`

small (since their coefficients are negative).

\(\blacksquare\)

- * Find the means of each variable for each group (genuine and counterfeit bills). You can get this from your fitted linear discriminant object.

Solution

`.3$means swiss`

```
length left right bottom top diag
counterfeit 214.823 130.300 130.193 10.530 11.133 139.450
genuine 214.969 129.943 129.720 8.305 10.168 141.517
```

\(\blacksquare\)

- Plot your linear discriminant(s), however you like. Bear in mind that there is only one linear discriminant.

Solution

With only one linear discriminant, we can plot `LD1`

scores on the \(y\)-axis and the grouping variable on the \(x\)-axis. How you do that is up to you.

Before we start, though, we need the `LD1`

scores. This means doing predictions. The discriminant scores are in there. We take the prediction output and make a data frame with all the things in the original data. My current preference (it changes) is to store the predictions, and then `cbind`

them with the original data, thus:

```
<- predict(swiss.3)
swiss.pred <- cbind(swiss, swiss.pred)
d head(d)
```

I needed `head`

because `cbind`

makes an old-fashioned `data.frame`

rather than a `tibble`

, so if you display it, you get all of it.

This gives the LD1 scores, predicted groups, and posterior probabilities as well. That saves us having to pick out the other things later. The obvious thing is a boxplot. By examining `d`

above (didn’t you?), you saw that the LD scores were in a column called `LD1`

:

`ggplot(d, aes(x = status, y = LD1)) + geom_boxplot()`

This shows that positive LD1 scores go (almost without exception) with genuine bills, and negative ones with counterfeit bills. It also shows that there are three outlier bills, two counterfeit ones with unusually high LD1 score, and one genuine one with unusually *low* LD1 score, at least for a genuine bill.

This goes to show that (the Box M test notwithstanding) the two types of bill really are different in a way that is worth investigating.

Or you could do faceted histograms of `LD1`

by `status`

:

`ggplot(d, aes(x = LD1)) + geom_histogram(bins = 10) + facet_grid(status ~ .)`

This shows much the same thing as `plot(swiss.3)`

does (try it).

\(\blacksquare\)

- What kind of score on
`LD1`

do genuine bills typically have? What kind of score do counterfeit bills typically have? What characteristics of a bill, therefore, would you look at to determine if a bill is genuine or counterfeit?

Solution

The genuine bills almost all have a *positive* score on LD1, while the counterfeit ones all have a *negative* one. This means that the genuine bills (depending on your answer to (here)) have a large `diag`

, or they have a large `left`

and `diag`

, and a small `right`

, `bottom`

and `top`

. If you look at your table of means in (here), you’ll see that the genuine bills do indeed have a large `diag`

, or, depending on your earlier answer, a small `right`

, `bottom`

and `top`

, but not actually a small `left`

(the `left`

values are very close for the genuine and counterfeit coins).

Extra: as to that last point, this is easy enough to think about. A boxplot seems a nice way to display it:

`ggplot(d, aes(y = left, x = status)) + geom_boxplot()`

There is a fair bit of overlap: the median is higher for the counterfeit bills, but the highest value actually belongs to a genuine one.

Compare that to `diag`

:

`ggplot(d, aes(y = diag, x = status)) + geom_boxplot()`

Here, there is an almost complete separation of the genuine and counterfeit bills, with just one low outlier amongst the genuine bills spoiling the pattern. I didn’t look at the predictions (beyond the discriminant scores), since this question (as set on an assignment a couple of years ago) was already too long, but there is no difficulty in doing so. Everything is in the data frame I called `d`

:

`with(d, table(obs = status, pred = class))`

```
pred
obs counterfeit genuine
counterfeit 100 0
genuine 1 99
```

(this labels the rows and columns, which is not necessary but is nice.)

The `tidyverse`

way is to make a data frame out of the actual and predicted statuses, and then `count`

what’s in there:

`%>% count(status, class) d `

This gives a “long” table, with frequencies for each of the combinations for which anything was observed.

Frequency tables are usually wide, and we can make this one so by pivot-wider-ing `pred`

:

```
%>%
d count(status, class) %>%
pivot_wider(names_from = class, values_from = n)
```

One of the genuine bills is incorrectly classified as a counterfeit one (evidently that low outlier on LD1), but every single one of the counterfeit bills is classified correctly. That missing value is actually a frequency that is zero, which you can fix up thus:

```
%>%
d count(status, class) %>%
pivot_wider(names_from = class, values_from = n, values_fill = 0)
```

which turns any missing values into the zeroes they should be in this kind of problem. It would be interesting to see what the posterior probabilities look like for that misclassified bill:

`%>% filter(status != class) d `

On the basis of the six measured variables, this looks a lot more like a counterfeit bill than a genuine one. Are there any other bills where there is any doubt? One way to find out is to find the maximum of the two posterior probabilities. If this is small, there is some doubt about whether the bill is real or fake. 0.99 seems like a very stringent cutoff, but let’s try it and see:

```
%>%
d mutate(max.post = pmax(posterior.counterfeit, posterior.genuine)) %>%
filter(max.post < 0.99) %>%
::select(-c(length:diag)) dplyr
```

The only one is the bill that was misclassified: it was actually genuine, but was classified as counterfeit. The posterior probabilities say that it was pretty unlikely to be genuine, but it was the only bill for which there was any noticeable doubt at all.

I had to use `pmax`

rather than `max`

there, because I wanted `max.post`

to contain the larger of the two corresponding entries: that is, the first entry in `max.post`

is the larger of the first entry of `counterfeit`

and the first entry in `genuine`

. If I used `max`

instead, I’d get the largest of *all* the entries in `counterfeit`

and *all* the entries in `genuine`

, repeated 200 times. (Try it and see.) `pmax`

stands for “parallel maximum”, that is, for each row separately. This also should work:

```
%>%
d rowwise() %>%
mutate(max.post = max(posterior.counterfeit, posterior.genuine)) %>%
filter(max.post < 0.99) %>%
select(-c(length:diag))
```

Because we’re using `rowwise`

, `max`

is applied to the pairs of values of `posterior.counterfeit`

and `posterior.genuine`

, *taken one row at a time.*

\(\blacksquare\)

## 35.9 Urine and obesity: what makes a difference?

A study was made of the characteristics of urine of young men. The men were classified into four groups based on their degree of obesity. (The groups are labelled `a, b, c, d`

.) Four variables were measured, `x`

(which you can ignore), pigment creatinine, chloride and chlorine. The data are in link as a `.csv`

file. There are 45 men altogether.

Yes, you may have seen this one before. What you found was something like this:

```
<- "http://ritsokiguess.site/datafiles/urine.csv"
my_url <- read_csv(my_url) urine
```

```
Rows: 45 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): obesity
dbl (4): x, creatinine, chloride, chlorine
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
<- with(urine, cbind(creatinine, chlorine, chloride))
response .1 <- manova(response ~ obesity, data = urine)
urinesummary(urine.1)
```

```
Df Pillai approx F num Df den Df Pr(>F)
obesity 3 0.43144 2.2956 9 123 0.02034 *
Residuals 41
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(BoxM(response, urine$obesity))`

```
Box's M Test
Chi-Squared Value = 30.8322 , df = 18 and p-value: 0.0301
```

Our aim is to understand why this result was significant. (Remember that the P-value on Box’s M test is not small enough to be worried about.)

- Read in the data again (copy the code from above) and obtain a discriminant analysis.

Solution

As above, plus:

```
.1 <- lda(obesity ~ creatinine + chlorine + chloride, data = urine)
urine.1 urine
```

```
Call:
lda(obesity ~ creatinine + chlorine + chloride, data = urine)
Prior probabilities of groups:
a b c d
0.2666667 0.3111111 0.2444444 0.1777778
Group means:
creatinine chlorine chloride
a 15.89167 5.275000 6.012500
b 17.82143 7.450000 5.214286
c 16.34545 8.272727 5.372727
d 11.91250 9.675000 3.981250
Coefficients of linear discriminants:
LD1 LD2 LD3
creatinine 0.24429462 -0.1700525 -0.02623962
chlorine -0.02167823 -0.1353051 0.11524045
chloride 0.23805588 0.3590364 0.30564592
Proportion of trace:
LD1 LD2 LD3
0.7476 0.2430 0.0093
```

\(\blacksquare\)

- How many linear discriminants were you expecting? Explain briefly.

Solution

There are 3 variables and 4 groups, so the smaller of 3 and \(4-1=3\): that is, 3.

\(\blacksquare\)

- Why do you think we should pay attention to the first two linear discriminants but not the third? Explain briefly.

Solution

The first two “proportion of trace” values are a lot bigger than the third (or, the third one is close to 0).

\(\blacksquare\)

- Plot the first two linear discriminant scores (against each other), with each obesity group being a different colour.

Solution

First obtain the predictions, and then make a data frame out of the original data and the predictions.

```
<- predict(urine.1)
urine.pred <- cbind(urine, urine.pred)
d head(d)
```

`urine`

produced the first five columns and `urine.pred`

produced the rest.

To go a more tidyverse way, we can combine the original data frame and the predictions using `bind_cols`

, but we have to be more careful that the things we are gluing together are both data frames:

`class(urine)`

`[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" `

`class(urine.pred)`

`[1] "list"`

`urine`

is a `tibble`

all right, but `urine.pred`

is a `list`

. What does it look like?

`glimpse(urine.pred)`

```
List of 3
$ class : Factor w/ 4 levels "a","b","c","d": 2 1 2 2 1 1 3 3 1 1 ...
$ posterior: num [1:45, 1:4] 0.233 0.36 0.227 0.294 0.477 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:45] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "a" "b" "c" "d"
$ x : num [1:45, 1:3] 0.393 -0.482 0.975 2.188 2.018 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:45] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "LD1" "LD2" "LD3"
```

A data frame is a list for which all the items are the same length, but some of the things in here are matrices. You can tell because they have a number of rows, 45, *and* a number of columns, 3 or 4. They *do* have the right number of rows, though, so something like `as.data.frame`

(a base R function) will smoosh them all into one data frame, grabbing the columns from the matrices:

`head(as.data.frame(urine.pred))`

You see that the columns that came from matrices have gained two-part names, the first part from the name of the matrix, the second part from the column name within that matrix. Then we can do this:

```
<- bind_cols(urine, as.data.frame(urine.pred))
dd dd
```

If you want to avoid base R altogether, though, and go straight to `bind_cols`

, you have to be more careful about the types of things. `bind_cols`

*only* works with vectors and data frames, not matrices, so that is what it is up to you to make sure you have. That means pulling out the pieces, turning them from matrices into data frames, and then gluing everything back together:

```
<- as_tibble(urine.pred$posterior)
post <- as_tibble(urine.pred$x)
ld <- bind_cols(urine, class = urine.pred$class, ld, post)
ddd ddd
```

That’s a lot of work, but you might say that it’s worth it because you are now absolutely sure what kind of thing everything is. I also had to be slightly careful with the vector of `class`

values; in `ddd`

it has to have a name, so I have to make sure I give it one.^{2} Any of these ways (in general) is good. The last way is a more careful approach, since you are making sure things are of the right type rather than relying on R to convert them for you, but I don’t mind which way you go. Now make the plot, making sure that you are using columns with the right names. I’m using my first data frame, with the two-part names:

`ggplot(d, aes(x = x.LD1, y = x.LD2, colour = obesity)) + geom_point()`

\(\blacksquare\)

- * Looking at your plot, discuss how (if at all) the discriminants separate the obesity groups. (Where does each obesity group fall on the plot?)

Solution

My immediate reaction was “they don’t much”. If you look a bit more closely, the `b`

group, in green, is on the right (high `LD1`

) and the `d`

group (purple) is on the left (low `LD1`

). The `a`

group, red, is mostly at the top (high `LD2`

) but the `c`

group, blue, really is all over the place.

The way to tackle interpreting a plot like this is to look for each group individually and see if that group is only or mainly found on a certain part of the plot.

This can be rationalized by looking at the “coefficients of linear discriminants” on the output. `LD1`

is low if creatinine and chloride are low (it has nothing much to do with `chlorine`

since that coefficient is near zero). Group `d`

is lowest on both creatinine and chloride, so that will be lowest on `LD1`

. `LD2`

is high if `chloride`

is high, or `creatinine`

and `chlorine`

are low. Out of the groups `a, b, c`

, `a`

has the highest mean on chloride and lowest means on the other two variables, so this should be highest on `LD2`

and (usually) is.

Looking at the means is only part of the story; if the individuals within a group are very variable, as they are here (especially group `c`

), then that group will appear all over the plot. The table of means only says how the *average* individual within a group stacks up.

`ggbiplot(urine.1, groups = urine$obesity)`

This shows (in a way that is perhaps easier to see) how the linear discriminants are related to the original variables, and thus how the groups differ in terms of the original variables.^{3} Most of the B’s are high creatinine and high chloride (on the right); most of the D’s are low on both (on the left). LD2 has a bit of `chloride`

, but not much of anything else. Extra: the way we used to do this was with “base graphics”, which involved plotting the `lda`

output itself:

`plot(urine.1)`

which is a plot of each discriminant score against each other one. You can plot just the first two, like this:

`plot(urine.1, dimen = 2)`

This is easier than using `ggplot`

, but (i) less flexible and (ii) you have to figure out how it works rather than doing things the standard `ggplot`

way. So I went with constructing a data frame from the predictions, and then `ggplot`

ting that. It’s a matter of taste which way is better.

\(\blacksquare\)

- * Obtain a table showing observed and predicted obesity groups. Comment on the accuracy of the predictions.

Solution

Make a table, one way or another:

```
<- with(d, table(obesity, class))
tab tab
```

```
class
obesity a b c d
a 7 3 2 0
b 2 9 2 1
c 3 4 1 3
d 2 0 1 5
```

`class`

is always the *predicted* group in these. You can also name things in `table`

. Or, if you prefer (equally good), the `tidyverse`

way of counting all the combinations of true `obesity`

and predicted `class`

, which can be done all in one go, or in two steps by saving the data frame first. I’m saving my results for later:

```
%>% count(obesity, class) -> tab
d tab
```

or if you prefer to make it look more like a table of frequencies:

`%>% pivot_wider(names_from=class, values_from=n, values_fill = list(n=0)) tab `

The thing on the end fills in zero frequencies as such (they would otherwise be `NA`

, which they are not: we know they are zero). My immediate reaction to this is “it’s terrible”! But at least some of the men have their obesity group correctly predicted: 7 of the \(7+3+2+0=12\) men that are actually in group `a`

are predicted to be in `a`

; 9 of the 14 actual `b`

’s are predicted to be `b`

’s; 5 of the 8 actual `d`

’s are predicted to be `d`

’s. These are not so awful. But only 1 of the 11 `c`

’s is correctly predicted to be a `c`

!

As for what I want to see: I am looking for some kind of statement about how good you think the predictions are (the word “terrible” is fine for this) with some kind of support for your statement. For example, “the predictions are not that good, but at least group B is predicted with some accuracy (9 out of 14).”

I think looking at how well the individual groups were predicted is the most incisive way of getting at this, because the `c`

men are the hardest to get right and the others are easier, but you could also think about an overall misclassification rate. This comes most easily from the “tidy” table:

`%>% count(correct = (obesity == class), wt = n) tab `

You can count anything, not just columns that already exist. This one is a kind of combined mutate-and-count to create the (logical) column called `correct`

.

It’s a shortcut for this:

```
%>%
tab mutate(is_correct = (obesity == class)) %>%
count(is_correct, wt = n)
```

If I don’t put the `wt`

, `count`

counts the number of *rows* for which the true and predicted obesity group is the same. But that’s not what I want here: I want the number of *observations* totalled up, which is what the `wt=`

does. It says “use the things in the given column as weights”, which means to total them up rather than count up the number of rows.

This says that 22 men were classified correctly and 23 were gotten wrong. We can find the proportions correct and wrong:

```
%>%
tab count(correct = (obesity == class), wt = n) %>%
mutate(proportion = n / sum(n))
```

and we see that 51% of men had their obesity group predicted wrongly. This is the overall misclassification rate, which is a simple summary of how good a job the discriminant analysis did.

There is a subtlety here. `n`

has changed its meaning in the middle of this calculation! In `tab`

, `n`

is counting the number of obesity observed and predicted combinations, but now it is counting the number of men classified correctly and incorrectly. The `wt=n`

uses the first `n`

, but the `mutate`

line uses the *new* `n`

, the result of the `count`

line here. (I think `count`

used to use `nn`

for the result of the second `count`

, so that you could tell them apart, but it no longer seems to do so.)

I said above that the obesity groups were not equally easy to predict. A small modification of the above will get the misclassification rates by (true) obesity group. This is done by putting an appropriate `group_by`

in at the front, before we do any summarizing:

```
%>%
tab group_by(obesity) %>%
count(correct = (obesity == class), wt = n) %>%
mutate(proportion = n / sum(n))
```

This gives the proportion wrong and correct for each (true) obesity group. I’m going to do the one more cosmetic thing to make it easier to read, a kind of “untidying”:

```
%>%
tab group_by(obesity) %>%
count(correct = (obesity == class), wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=correct, values_from=proportion)
```

Looking down the `TRUE`

column, groups A, B and D were gotten about 60% correct (and 40% wrong), but group C is much worse. The overall misclassification rate is made bigger by the fact that C is so hard to predict.

Find out for yourself what happens if I fail to remove the `n`

column before doing the `pivot_wider`

.

A slightly more elegant look is obtained this way, by making nicer values than TRUE and FALSE:

```
%>%
tab group_by(obesity) %>%
mutate(prediction_stat = ifelse(obesity == class, "correct", "wrong")) %>%
count(prediction_stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=prediction_stat, values_from=proportion)
```

\(\blacksquare\)

Solution

On the plot of (here), we said that there was a lot of scatter, but that groups `a`

, `b`

and `d`

tended to be found at the top, right and left respectively of the plot. That suggests that these three groups should be somewhat predictable. The `c`

’s, on the other hand, were all over the place on the plot, and were mostly predicted wrong.

The idea is that the stories you pull from the plot and the predictions should be more or less consistent. There are several ways you might say that: another approach is to say that the observations are all over the place on the plot, and the predictions are all bad. This is not as insightful as my comments above, but if that’s what the plot told you, that’s what the predictions would seem to be saying as well. (Or even, the predictions are not so bad compared to the apparently random pattern on the plot, if that’s what you saw. There are different ways to say something more or less sensible.)

\(\blacksquare\)

## 35.10 Understanding a MANOVA

One use of discriminant analysis is to understand the results of a MANOVA. This question is a followup to a previous MANOVA that we did, the one with two variables `y1`

and `y2`

and three groups `a`

through `c`

. The data were in link.

- Read the data in again and run the MANOVA that you did before.

Solution

This is an exact repeat of what you did before:

```
<- "http://ritsokiguess.site/datafiles/simple-manova.txt"
my_url # my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/simple-manova.txt"
<- read_delim(my_url, " ") simple
```

```
Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): group
dbl (2): y1, y2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` simple`

```
<- with(simple, cbind(y1, y2))
response .3 <- manova(response ~ group, data = simple)
simplesummary(simple.3)
```

```
Df Pillai approx F num Df den Df Pr(>F)
group 2 1.3534 9.4196 4 18 0.0002735 ***
Residuals 9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

This P-value is small, so there is some way in which some of the groups differ on some of the variables.^{4}

We should check that we believe this, using Box’s M test:^{5}

`summary(BoxM(response, simple$group))`

```
Box's M Test
Chi-Squared Value = 3.517357 , df = 6 and p-value: 0.742
```

There is no problem here: no evidence that any of the response variables differ in spread across the groups.

\(\blacksquare\)

- Run a discriminant analysis “predicting” group from the two response variables. Display the output.

Solution

This:

```
.4 <- lda(group ~ y1 + y2, data = simple)
simple.4 simple
```

```
Call:
lda(group ~ y1 + y2, data = simple)
Prior probabilities of groups:
a b c
0.3333333 0.2500000 0.4166667
Group means:
y1 y2
a 3.000000 4.0
b 4.666667 7.0
c 8.200000 6.4
Coefficients of linear discriminants:
LD1 LD2
y1 0.7193766 0.4060972
y2 0.3611104 -0.9319337
Proportion of trace:
LD1 LD2
0.8331 0.1669
```

Note that this is the other way around from MANOVA: here, we are “predicting the group” from the response variables, in the same manner as one of the flavours of logistic regression: “what makes the groups different, in terms of those response variables?”.

\(\blacksquare\)

- * In the output from the discriminant analysis, why are there exactly two linear discriminants
`LD1`

and`LD2`

?

Solution

There are two linear discriminants because there are 3 groups and two variables, so there are the smaller of \(3-1\) and 2 discriminants.

\(\blacksquare\)

- * From the output, how would you say that the first linear discriminant
`LD1`

compares in importance to the second one`LD2`

: much more important, more important, equally important, less important, much less important? Explain briefly.

Solution

Look at the `Proportion of trace`

at the bottom of the output. The first number is much bigger than the second, so the first linear discriminant is much more important than the second. (I care about your reason; you can say it’s “more important” rather than “much more important” and I’m good with that.)

\(\blacksquare\)

- Obtain a plot of the discriminant scores.

Solution

This was the old-fashioned way:

`plot(simple.4)`

It needs cajoling to produce colours, but we can do better. The first thing is to obtain the predictions:

`<- predict(simple.4) simple.pred `

Then we make a data frame out of the discriminant scores and the true groups, using `cbind`

:

```
<- cbind(simple, simple.pred)
d head(d)
```

or like this, for fun:^{6}

```
<- as_tibble(simple.pred$x)
ld <- as_tibble(simple.pred$posterior)
post <- bind_cols(simple, class = simple.pred$class, ld, post)
dd dd
```

After that, we plot the first one against the second one, colouring by true groups:

`ggplot(d, aes(x = x.LD1, y = x.LD2, colour = group)) + geom_point()`

I wanted to compare this plot with the original plot of `y1`

vs. `y2`

, coloured by groups:

`ggplot(simple, aes(x = y1, y = y2, colour = group)) + geom_point()`

The difference between this plot and the one of `LD1`

vs.

`LD2`

is that things have been rotated a bit so that most of the separation of groups is done by `LD1`

. This is reflected in the fact that `LD1`

is quite a bit more important than `LD2`

: the latter doesn’t help much in separating the groups.

With that in mind, we could also plot just `LD1`

, presumably against groups via boxplot:

`ggplot(d, aes(x = group, y = x.LD1)) + geom_boxplot()`

This shows that LD1 does a pretty fine job of separating the groups, and `LD2`

doesn’t really have much to add to the picture.

\(\blacksquare\)

- Describe briefly how
`LD1`

and/or`LD2`

separate the groups. Does your picture confirm the relative importance of`LD1`

and`LD2`

that you found back in part (here)? Explain briefly.

Solution

`LD1`

separates the groups left to right: group `a`

is low on `LD1`

, `b`

is in the middle and `c`

is high on `LD1`

. (There is no intermingling of the groups on `LD1`

, so it separates the groups perfectly.)

As for `LD2`

, all it does (possibly) is to distinguish `b`

(low) from `a`

and `c`

(high). Or you can, just as reasonably, take the view that it doesn’t really separate any of the groups.

Back in part (here), you said (I hope) that `LD1`

was (very) important compared to `LD2`

. This shows up here in that `LD1`

does a very good job of distinguishing the groups, while `LD2`

does a poor to non-existent job of separating any groups. (If you didn’t say that before, here is an invitation to reconsider what you *did* say there.)

\(\blacksquare\)

- What makes group
`a`

have a low score on`LD1`

? There are two steps that you need to make: consider the means of group`a`

on variables`y1`

and`y2`

and how they compare to the other groups, and consider how`y1`

and`y2`

play into the score on`LD1`

.

Solution

The information you need is in the big output.

The means of `y1`

and `y2`

for group `a`

are 3 and 4 respectively, which are the lowest of all the groups. That’s the first thing.

The second thing is the coefficients of `LD1`

in terms of `y1`

and `y2`

, which are both *positive*. That means, for any observation, if its `y1`

and `y2`

values are *large*, that observation’s score on `LD1`

will be large as well. Conversely, if its values are *small*, as the ones in group `a`

are, its score on `LD1`

will be small.

You need these two things.

This explains why the group `a`

observations are on the left of the plot. It also explains why the group `c`

observations are on the right: they are *large* on both `y1`

and `y2`

, and so large on `LD1`

.

What about `LD2`

? This is a little more confusing (and thus I didn’t ask you about that). Its “coefficients of linear discriminant” are positive on `y1`

and negative on `y2`

, with the latter being bigger in size. Group `b`

is about average on `y1`

and distinctly *high* on `y2`

; the second of these coupled with the negative coefficient on `y2`

means that the `LD2`

score for observations in group `b`

will be *negative*.

For `LD2`

, group `a`

has a low mean on both variables and group `c`

has a high mean, so for both groups there is a kind of cancelling-out happening, and neither group `a`

nor group `c`

will be especially remarkable on `LD2`

.

\(\blacksquare\)

- Obtain predictions for the group memberships of each observation, and make a table of the actual group memberships against the predicted ones. How many of the observations were wrongly classified?

Solution

Use the `simple.pred`

that you got earlier. This is the `table`

way:

`with(d, table(obs = group, pred = class))`

```
pred
obs a b c
a 4 0 0
b 0 3 0
c 0 0 5
```

Every single one of the 12 observations has been classified into its correct group. (There is nothing off the diagonal of this table.) The alternative to `table`

is the `tidyverse`

way:

`%>% count(group, class) d `

or

```
%>%
d count(group, class) %>%
pivot_wider(names_from=class, values_from=n, values_fill = list(n=0))
```

if you want something that looks like a frequency table. All the `a`

s got classified as `a`

, and so on. That’s the end of what I asked you to do, but as ever I wanted to press on. The next question to ask after getting the predicted groups is “what are the posterior probabilities of being in each group for each observation”: that is, not just which group do I think it belongs in, but how sure am I about that call? The posterior probabilities in my `d`

start with `posterior`

. These have a ton of decimal places which I like to round off first before I display them, eg. to 3 decimals here:

```
%>%
d select(y1, y2, group, class, starts_with("posterior")) %>%
mutate(across(starts_with("posterior"), \(post) round(post, 3)))
```

You see that the posterior probability of an observation being in the group it actually *was* in is close to 1 all the way down. The only one with any doubt at all is observation #6, which is actually in group `b`

, but has “only” probability 0.814 of being a `b`

based on its `y1`

and `y2`

values. What else could it be? Well, it’s about equally split between being `a`

and `c`

. Let me see if I can display this observation on the plot in a different way. First I need to make a new column picking out observation 6, and then I use this new variable as the `shape`

of the point I plot:

```
%>%
simple mutate(is6 = (row_number() == 6)) %>%
ggplot(aes(x = y1, y = y2, colour = group, shape = is6)) +
geom_point(size = 3)
```

That makes it stand out a bit: if you look carefully, one of the green points (observation 6) is plotted as a triangle rather than a circle, as the legend for `is6`

indicates. (I plotted all the points bigger to make this easier to see.)

Since observation #6 is in group `b`

, it appears as a green triangle. What makes it least like a `b`

? Well, it has the smallest `y2`

value of any of the `b`

’s (which makes it most like an `a`

of any of the `b`

’s), and it has the largest `y1`

value (which makes it most like a `c`

of any of the `b`

’s). But still, it’s nearer the greens than anything else, so it’s still more like a `b`

than it is like any of the other groups.

\(\blacksquare\)

## 35.11 What distinguishes people who do different jobs?

244^{7} people work at a certain company. They each have one of three jobs: customer service, mechanic, dispatcher. In the data set, these are labelled 1, 2 and 3 respectively. In addition, they each are rated on scales called `outdoor`

, `social`

and `conservative`

. Do people with different jobs tend to have different scores on these scales, or, to put it another way, if you knew a person’s scores on `outdoor`

, `social`

and `conservative`

, could you say something about what kind of job they were likely to hold? The data are in link.

- Read in the data and display some of it.

Solution

The usual. This one is aligned columns. I’m using a “temporary” name for my read-in data frame, since I’m going to create the proper one in a moment.

```
<- "https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt"
my_url <- read_table(my_url) jobs0
```

`Warning: Missing column names filled in: 'X6' [6]`

```
── Column specification ────────────────────────────────────────────────────────
cols(
outdoor = col_double(),
social = col_double(),
conservative = col_double(),
job = col_double(),
id = col_double(),
X6 = col_character()
)
```

```
Warning: 244 parsing failures.
row col expected actual file
1 -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
2 -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
3 -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
4 -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
5 -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
... ... ......... ......... ....................................................................
See problems(...) for more details.
```

` jobs0`

We got all that was promised, plus a label `id`

for each employee, which we will from here on ignore.^{8}

\(\blacksquare\)

- Note the types of each of the variables, and create any new variables that you need to.

Solution

These are all `int`

or whole numbers. But, the job ought to be a `factor`

: the labels 1, 2 and 3 have no meaning as such, they just label the three different jobs. (I gave you a hint of this above.) So we need to turn `job`

into a factor. I think the best way to do that is via `mutate`

, and then we save the new data frame into one called `jobs`

that we actually use for the analysis below:

```
<- c("custserv", "mechanic", "dispatcher")
job_labels %>%
jobs0 mutate(job = factor(job, labels = job_labels)) -> jobs
```

I lived on the edge and saved my factor `job`

into a variable with the same name as the numeric one. I should check that I now have the right thing:

` jobs`

I like this better because you see the actual factor levels rather than the underlying numeric values by which they are stored.

All is good here. If you forget the `labels`

thing, you’ll get a factor, but its levels will be 1, 2, and 3, and you will have to remember which jobs they go with. I’m a fan of giving factors named levels, so that you can remember what stands for what.^{9}

Extra: another way of doing this is to make a lookup table, that is, a little table that shows which job goes with which number:

```
<- tribble(
lookup_tab ~job, ~jobname,
1, "custserv",
2, "mechanic",
3, "dispatcher"
) lookup_tab
```

I carefully put the numbers in a column called `job`

because I want to match these with the column called `job`

in `jobs0`

:

```
%>%
jobs0 left_join(lookup_tab) %>%
sample_n(20)
```

`Joining with `by = join_by(job)``

You see that each row has the *name* of the job that employee has, in the column `jobname`

, because the job `id`

was looked up in our lookup table. (I displayed some random rows so you could see that it worked.)

\(\blacksquare\)

- Run a multivariate analysis of variance to convince yourself that there are some differences in scale scores among the jobs.

Solution

You know how to do this, right? This one is the easy way:

```
<- with(jobs, cbind(social, outdoor, conservative))
response .1 <- manova(response ~ job, data = jobs)
responsesummary(response.1)
```

```
Df Pillai approx F num Df den Df Pr(>F)
job 2 0.76207 49.248 6 480 < 2.2e-16 ***
Residuals 241
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Or you can use `Manova`

. That is mostly for practice here, since there is no reason to make things difficult for yourself:

```
library(car)
.2 <- lm(response ~ job, data = jobs)
responsesummary(Manova(response.2))
```

```
Type II MANOVA Tests:
Sum of squares and products for error:
social outdoor conservative
social 4406.2994 334.90273 235.46301
outdoor 334.9027 4082.46302 64.76334
conservative 235.4630 64.76334 2683.25695
------------------------------------------
Term: job
Sum of squares and products for the hypothesis:
social outdoor conservative
social 2889.1228 -794.3945 -1405.8401
outdoor -794.3945 1609.7993 283.1711
conservative -1405.8401 283.1711 691.7594
Multivariate Tests: job
Df test stat approx F num Df den Df Pr(>F)
Pillai 2 0.7620657 49.24757 6 480 < 2.22e-16 ***
Wilks 2 0.3639880 52.38173 6 478 < 2.22e-16 ***
Hotelling-Lawley 2 1.4010307 55.57422 6 476 < 2.22e-16 ***
Roy 2 1.0805270 86.44216 3 240 < 2.22e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

This version gives the four different versions of the test (rather than just the Pillai test that `manova`

gives), but the results are in this case identical for all of them.

So: oh yes, there are differences (on some or all of the variables, for some or all of the groups). So we need something like discriminant analysis to understand the differences.

We really ought to follow this up with Box’s M test, to be sure that the variances and correlations for each variable are equal enough across the groups, but we note off the top that the P-values (all of them) are really small, so there ought not to be much doubt about the conclusion anyway:

`summary(BoxM(response, jobs$job))`

```
Box's M Test
Chi-Squared Value = 25.64176 , df = 12 and p-value: 0.0121
```

This is small, but not (for this test) small enough to worry about (it’s not less than 0.001).

This, and the `lda`

below, actually works perfectly well if you use the original (integer) job, but then you have to remember which job number is which.

\(\blacksquare\)

- Run a discriminant analysis and display the output.

Solution

Now `job`

is the “response”:

```
.1 <- lda(job ~ social + outdoor + conservative, data = jobs)
job.1 job
```

```
Call:
lda(job ~ social + outdoor + conservative, data = jobs)
Prior probabilities of groups:
custserv mechanic dispatcher
0.3483607 0.3811475 0.2704918
Group means:
social outdoor conservative
custserv 24.22353 12.51765 9.023529
mechanic 21.13978 18.53763 10.139785
dispatcher 15.45455 15.57576 13.242424
Coefficients of linear discriminants:
LD1 LD2
social -0.19427415 -0.04978105
outdoor 0.09198065 -0.22501431
conservative 0.15499199 0.08734288
Proportion of trace:
LD1 LD2
0.7712 0.2288
```

\(\blacksquare\)

- Which is the more important,
`LD1`

or`LD2`

? How much more important? Justify your answer briefly.

Solution

Look at the “proportion of trace” at the bottom. The value for `LD1`

is quite a bit higher, so `LD1`

is quite a bit more important when it comes to separating the groups. `LD2`

is, as I said, less important, but is not completely worthless, so it will be worth taking a look at it.

\(\blacksquare\)

- Describe what values for an individual on the scales will make each of
`LD1`

and`LD2`

high.

Solution

This is a two-parter: decide whether each scale makes a positive, negative or zero contribution to the linear discriminant (looking at the “coefficients of linear discriminants”), and then translate that into what would make each `LD`

high. Let’s start with `LD1`

:

Its coefficients on the three scales are respectively negative (\(-0.19\)), zero (0.09; my call) and positive (0.15). Where you draw the line is up to you: if you want to say that `outdoor`

’s contribution is positive, go ahead. This means that `LD1`

will be high if `social`

is *low* and if `conservative`

is *high*. (If you thought that `outdoor`

’s coefficient was positive rather than zero, if `outdoor`

is high as well.)

Now for `LD2`

: I’m going to call `outdoor`

’s coefficient of \(-0.22\) negative and the other two zero, so that `LD2`

is high if `outdoor`

is *low*. Again, if you made a different judgement call, adapt your answer accordingly.

\(\blacksquare\)

- The first group of employees, customer service, have the highest mean on
`social`

and the lowest mean on both of the other two scales. Would you expect the customer service employees to score high or low on`LD1`

? What about`LD2`

?

Solution

In the light of what we said in the previous part, the customer service employees, who are high on `social`

and low on `conservative`

, should be *low* (negative) on `LD1`

, since both of these means are pointing that way. As I called it, the only thing that matters to `LD2`

is `outdoor`

, which is *low* for the customer service employees, and thus `LD2`

for them will be *high* (negative coefficient).

\(\blacksquare\)

- Plot your discriminant scores (which you will have to obtain first), and see if you were right about the customer service employees in terms of
`LD1`

and`LD2`

. The job names are rather long, and there are a lot of individuals, so it is probably best to plot the scores as coloured circles with a legend saying which colour goes with which job (rather than labelling each individual with the job they have).

Solution

Predictions first, then make a data frame combining the predictions with the original data:

```
<- predict(job.1)
p <- cbind(jobs, p)
d d
```

Following my suggestion, plot these the standard way with `colour`

distinguishing the jobs:

`ggplot(d, aes(x = x.LD1, y = x.LD2, colour = job)) + geom_point()`

`ggplot(d, aes(x = x.LD1, y = x.LD2, colour = class)) + geom_point()`

`# ggplot(d, aes(x = job, y = x.LD1)) + geom_boxplot()`

I was mostly right about the customer service people: small `LD1`

definitely, large `LD2`

kinda. I wasn’t more right because the group means don’t tell the whole story: evidently, the customer service people vary quite a bit on `outdoor`

, so the red dots are all over the left side of the plot.

There is quite a bit of intermingling of the three employee groups on the plot, but the point of the MANOVA is that the groups are (way) more separated than you’d expect by chance, that is if the employees were just randomly scattered across the plot.

To think back to that `trace`

thing: here, it seems that `LD1`

mainly separates customer service (left) from dispatchers (right); the mechanics are all over the place on `LD1`

, but they tend to be low on `LD2`

. So `LD2`

*does* have something to say.

\(\blacksquare\)

- * Obtain predicted job allocations for each individual (based on their scores on the three scales), and tabulate the true jobs against the predicted jobs. How would you describe the quality of the classification? Is that in line with what the plot would suggest?

Solution

Use the predictions that you got before and saved in `d`

:

`with(d, table(obs = job, pred = class))`

```
pred
obs custserv mechanic dispatcher
custserv 68 13 4
mechanic 16 67 10
dispatcher 3 13 50
```

Or, the `tidyverse`

way:

`%>% count(job, class) d `

or:

```
%>%
d count(job, class) %>%
pivot_wider(names_from=class, values_from=n, values_fill = list(n=0))
```

I didn’t really need the `values_fill`

since there are no missing frequencies, but I’ve gotten used to putting it in. There are a lot of misclassifications, but there are a lot of people, so a large fraction of people actually got classified correctly. The biggest frequencies are of people who got classified correctly. I think this is about what I was expecting, looking at the plot: the people top left are obviously customer service, the ones top right are in dispatch, and most of the ones at the bottom are mechanics. So there will be some errors, but the majority of people should be gotten right. The easiest pairing to get confused is customer service and mechanics, which you might guess from the plot: those customer service people with a middling `LD1`

score and a low `LD2`

score (that is, high on `outdoor`

) could easily be confused with the mechanics. The easiest pairing to distinguish is customer service and dispatchers: on the plot, left and right, that is, low and high respectively on `LD1`

.

What fraction of people actually got misclassified? You could just pull out the numbers and add them up, but you know me: I’m too lazy to do that.

We can work out the total number and fraction who got misclassified. There are different ways you might do this, but the `tidyverse`

way provides the easiest starting point. For example, we can make a new column that indicates whether a group is the correct or wrong classification:

```
%>%
d count(job, class) %>%
mutate(job_stat = ifelse(job == class, "correct", "wrong"))
```

From there, we count up the correct and wrong ones, recognizing that we want to total up the *frequencies* in `n`

, not just count the number of rows:

```
%>%
d count(job, class) %>%
mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>%
count(job_stat, wt = n)
```

and turn these into proportions:

```
%>%
d count(job, class) %>%
mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>%
count(job_stat, wt = n) %>%
mutate(proportion = n / sum(n))
```

There is a `count`

followed by another `count`

of the first lot of counts, so the second count column has taken over the name `n`

.

24% of all the employees got classified into the wrong job, based on their scores on `outdoor`

, `social`

and `conservative`

.

This is actually not bad, from one point of view: if you just guessed which job each person did, without looking at their scores on the scales at all, you would get \({1\over 3}=33\%\) of them right, just by luck, and \({2\over3}=67\%\) of them wrong. From 67% to 24% error is a big improvement, and *that* is what the MANOVA is reacting to.

To figure out whether some of the groups were harder to classify than others, squeeze a `group_by`

in early to do the counts and proportions for each (true) job:

```
%>%
d count(job, class) %>%
mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>%
group_by(job) %>%
count(job_stat, wt = n) %>%
mutate(proportion = n / sum(n))
```

or even split out the correct and wrong ones into their own columns:

```
%>%
d count(job, class) %>%
mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>%
group_by(job) %>%
count(job_stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=job_stat, values_from=proportion)
```

The mechanics were hardest to get right and easiest to get wrong, though there isn’t much in it. I think the reason is that the mechanics were sort of “in the middle” in that a mechanic could be mistaken for either a dispatcher or a customer service representative, but but customer service and dispatchers were more or less distinct from each other.

It’s up to you whether you prefer to do this kind of thing by learning enough about `table`

to get it to work, or whether you want to use tidy-data mechanisms to do it in a larger number of smaller steps. I immediately thought of `table`

because I knew about it, but the tidy-data way is more consistent with the way we have been doing things.

\(\blacksquare\)

- Consider an employee with these scores: 20 on
`outdoor`

, 17 on`social`

and 8 on`conservative`

What job do you think they do, and how certain are you about that? Use`predict`

, first making a data frame out of the values to predict for.

Solution

This is in fact exactly the same idea as the data frame that I generally called `new`

when doing predictions for other models. I think the clearest way to make one of these is with `tribble`

:

```
<- tribble(
new ~outdoor, ~social, ~conservative,
20, 17, 8
) new
```

`summary(jobs)`

```
outdoor social conservative job
Min. : 0.00 Min. : 7.00 Min. : 0.00 custserv :85
1st Qu.:13.00 1st Qu.:17.00 1st Qu.: 8.00 mechanic :93
Median :16.00 Median :21.00 Median :11.00 dispatcher:66
Mean :15.64 Mean :20.68 Mean :10.59
3rd Qu.:19.00 3rd Qu.:25.00 3rd Qu.:13.00
Max. :28.00 Max. :35.00 Max. :20.00
id X6
Min. : 1.00 Length:244
1st Qu.:21.00 Class :character
Median :41.00 Mode :character
Mean :41.95
3rd Qu.:61.25
Max. :93.00
```

There’s no need for `datagrid`

or `crossing`

here because I’m not doing combinations of things. (I might have done that here, to get a sense for example of “what effect does a higher score on `outdoor`

have on the likelihood of a person doing each job?”. But I didn’t.)

Then feed this into `predict`

as the *second* thing:

`<- predict(job.1, new) pp1 `

Our predictions are these:

`cbind(new, pp1)`

The `class`

thing gives our predicted job, and the `posterior`

probabilities say how sure we are about that. So we reckon there’s a 78% chance that this person is a mechanic; they might be a dispatcher but they are unlikely to be in customer service. Our best guess is that they are a mechanic.^{10}

Does this pass the sanity-check test? First figure out where our new employee stands compared to the others:

`summary(jobs)`

```
outdoor social conservative job
Min. : 0.00 Min. : 7.00 Min. : 0.00 custserv :85
1st Qu.:13.00 1st Qu.:17.00 1st Qu.: 8.00 mechanic :93
Median :16.00 Median :21.00 Median :11.00 dispatcher:66
Mean :15.64 Mean :20.68 Mean :10.59
3rd Qu.:19.00 3rd Qu.:25.00 3rd Qu.:13.00
Max. :28.00 Max. :35.00 Max. :20.00
id X6
Min. : 1.00 Length:244
1st Qu.:21.00 Class :character
Median :41.00 Mode :character
Mean :41.95
3rd Qu.:61.25
Max. :93.00
```

Their score on `outdoor`

is above average, but their scores on the other two scales are below average (right on the 1st quartile in each case).

Go back to the table of means from the discriminant analysis output. The mechanics have the highest average for `outdoor`

, they’re in the middle on `social`

and they are lowish on `conservative`

. Our new employee is at least somewhat like that.

Or, we can figure out where our new employee sits on the plot. The output from `predict`

gives the predicted `LD1`

and `LD2`

, which are 0.71 and \(-1.02\) respectively. This employee would sit to the right of and below the middle of the plot: in the greens, but with a few blues nearby: most likely a mechanic, possibly a dispatcher, but likely not customer service, as the posterior probabilities suggest.

Extra: I can use the same mechanism to predict for a combination of values. This would allow for the variability of each of the original variables to differ, and enable us to assess the effect of, say, a change in `conservative`

over its “typical range”, which we found out above with `summary(jobs)`

. I’ll take the quartiles, in my usual fashion:

```
<- c(13, 19)
outdoors <- c(17, 25)
socials <- c(8, 13) conservatives
```

The IQRs are not that different, which says that what we get here will not be that different from the ``coefficients of linear discriminants’’ above:

```
<- crossing(
new outdoor = outdoors, social = socials,
conservative = conservatives
)<- predict(job.1, new)
pp2 <- round(pp2$x, 2)
px cbind(new, pp2$class, px)
```

The highest (most positive) LD1 score goes with high outdoor, low social, high conservative (and being a dispatcher). It is often interesting to look at the *second*-highest one as well: here that is *low* outdoor, and the same low social and high conservative as before. That means that `outdoor`

has nothing much to do with `LD1`

score. Being low `social`

is strongly associated with `LD1`

being positive, so that’s the important part of `LD1`

.

What about `LD2`

? The most positive LD2 are these:

```
LD2 outdoor social conservative
====================================
0.99 low low high
0.59 low high high
0.55 low low low
```

These most consistently go with `outdoor`

being low.

Is that consistent with the “coefficients of linear discriminants”?

`.1$scaling job`

```
LD1 LD2
social -0.19427415 -0.04978105
outdoor 0.09198065 -0.22501431
conservative 0.15499199 0.08734288
```

Very much so: `outdoor`

has nothing much to do with `LD1`

and everything to do with `LD2`

.

\(\blacksquare\)

- Since I am not making you hand this one in, I’m going to keep going. Re-run the analysis to incorporate cross-validation, and make a table of the predicted group memberships. Is it much different from the previous one you had? Why would that be?

Solution

Stick a `CV=T`

in the `lda`

:

```
.3 <- lda(job ~ social + outdoor + conservative, data = jobs, CV = T)
jobglimpse(job.3)
```

```
List of 5
$ class : Factor w/ 3 levels "custserv","mechanic",..: 1 2 1 1 1 2 1 1 1 1 ...
$ posterior: num [1:244, 1:3] 0.902 0.352 0.71 0.805 0.766 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:244] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "custserv" "mechanic" "dispatcher"
$ terms :Classes 'terms', 'formula' language job ~ social + outdoor + conservative
.. ..- attr(*, "variables")= language list(job, social, outdoor, conservative)
.. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. ..- attr(*, "term.labels")= chr [1:3] "social" "outdoor" "conservative"
.. ..- attr(*, "order")= int [1:3] 1 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(job, social, outdoor, conservative)
.. ..- attr(*, "dataClasses")= Named chr [1:4] "factor" "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:4] "job" "social" "outdoor" "conservative"
$ call : language lda(formula = job ~ social + outdoor + conservative, data = jobs, CV = T)
$ xlevels : Named list()
```

This directly contains a `class`

(no need for a `predict`

), so we make a data frame, with a different name since I shortly want to compare this one with the previous one:

```
<- cbind(jobs, class = job.3$class, posterior = job.3$posterior)
dcv head(dcv)
```

This is a bit fiddlier than before because `job.3`

contains some things of different lengths and we can’t just `cbind`

them all together.

Then go straight to the `table`

:

`with(dcv, table(job, class))`

```
class
job custserv mechanic dispatcher
custserv 67 14 4
mechanic 16 67 10
dispatcher 3 14 49
```

This is almost exactly the same as we had in part (here): the cross-validation has made almost no difference. The reason for that is that here, we have lots of data (you can predict for one mechanic, say, and there are still lots of others to say that the mechanics are “over there”. This is in sharp contrast to the example in class with the bellydancers, where if you try to predict for one of the extreme ones, the notion of “where are the bellydancers” changes substantially. Here, I suspect that the few people whose predictions changed were ones where the posterior probabilities were almost equal for two jobs, and the cross-validation was just enough to tip the balance. You can check this, but there are a lot of posterior probabilities to look at!

This is another way of saying that with small data sets, your conclusions are more “fragile” or less likely to be generalizable. With a larger data set like this one, cross-validation, which is the right thing to do, makes almost no difference.^{11}

All right, I suppose I do want to investigate the individuals whose predicted jobs changed, and look at their posterior probabilities. I think I have the machinery to do that.

Let’s start by gluing together the dataframes with the predictions from the regular `lda`

(in `d`

) and the ones from the cross-validation (in `dcv`

). I think I can do that like this:

`%>% left_join(dcv, join_by(id, job)) d `

There’s already subtlety. The people are numbered separately within each actual `job`

, so the thing that uniquely identifies each person (what database people call a “key”) is the combination of the actual job they do *plus* their `id`

within that job. You might also think of using `bind_cols`

, except that this adds a number to all the column names which is a pain to deal with. I don’t really need to look up the people in the second dataframe, since I know where they are (in the corresponding rows), but doing so seems to make everything else easier.

The columns with an `x`

on the end of their names came from `d`

, that is, the predictions without cross-validation, and the ones with a `y`

came from cross-validation. Let’s see if we can keep only the columns we need so that it’s a bit less unwieldy:

```
%>% left_join(dcv, join_by(id, job)) %>%
d select(outdoor = outdoor.x,
social = social.x,
conservative = conservative.x,
starts_with("class"),
job, id, starts_with("posterior")) -> all
all
```

That’s not too bad. We could shorten some variable names and reduce some decimal places, but that’ll do for now.

How many individuals were predicted differently? We have columns called `class.x`

(predicted group membership from original LDA) and `class.y`

(from cross-validation), and so:

`%>% filter(class.x != class.y) all `

There are exactly *two* individuals that were predicted differently. Under cross-validation, they both got called mechanics. How do their posterior probabilities compare? These are all in columns beginning with `posterior`

. We could scrutinize the output above, or try to make things simpler. Let’s round them to three decimals, and then display only some of the columns:

```
%>% filter(class.x != class.y) %>%
all mutate(across(starts_with("posterior"), \(post) round(post, 3))) %>%
select(id, job, starts_with("posterior"))
```

And then, because I can, let’s re-format that to make it easier to read, `x`

being regular LDA and `y`

being cross-validation:

```
%>% filter(class.x != class.y) %>%
all mutate(across(starts_with("posterior"), \(post) round(post, 3))) %>%
select(id, job, starts_with("posterior")) %>%
pivot_longer(starts_with("posterior"), names_to = c("post_job", "method"),
names_pattern = "posterior\\.(.*)\\.(.)", values_to = "prob") %>%
pivot_wider(names_from = method, values_from = prob)
```

As I suspected, the posterior probabilities in each case are almost identical, but different ones happen to be slightly higher in the two cases. For the first individual (actually in customer service), cross-validation just tweaked the posterior probabilities enough to call that individual a mechanic, and for the second one, actually a dispatcher, the first analysis was almost too close to call, and things under cross-validation got nudged onto the mechanic side again.

All right, what *about* those people who got misclassified (say, by the LDA rather than the cross-validation, since it seems not to make much difference)?

Let’s count them first:

```
%>% mutate(is_correct = ifelse(job == class.x, "correct", "wrong")) -> all.mis
all %>%
all.mis count(is_correct == "wrong") %>%
mutate(proportion = n / sum(n))
```

24% of them. There are a lot of them, so we’ll pick a random sample to look at, rounding the posterior probabilities to 3 decimals first and reducing the number of columns to look at:

```
set.seed(457299)
%>%
all.mis filter(is_correct == "wrong") %>%
mutate(across(starts_with("posterior"), \(post) round(post, 3))) %>%
select(
id, job, class.x, outdoor, social, conservative,starts_with("posterior")
%>%
) sample_n(15)
```

I put the `set.seed`

in so that this will come out the same each time I do it, and so that the discussion below always makes sense.

Now we can look at the true and predicted jobs for these people, and the posterior probabilities (which I rounded earlier).

- The first one, id 6, is badly wrong; this was actually a mechanic, but the posterior probabilities say that it is a near-certain dispatcher.
- The second one, id 65, is a little better, but the posterior probability of actually being a mechanic is only 0.269; the probability of being a dispatcher is much higher at 0.593, so that’s what it gets classified as.
- The third one, though, id #61, is a very close call: posterior probability 0.438 of being in customer service (correct), 0.453 of being a dispatcher, only slightly higher, but enough to make the prediction wrong.

The implication from looking at our sample of 15 people is that some of them are “badly” misclassified (with a high posterior probability of having a different job from the one they actually hold), but a lot of them came out on the wrong end of a close call. This suggests that a number of the correct classifications came out *right* almost by chance as well, with (hypothesizing) two close posterior probabilities of which their actual job came out slightly higher.

Further further analysis would look at the original variables `social`

, `outdoor`

and `conservative`

for the misclassified people, and try to find out what was unusual about them. But I think now would be an excellent place for me to stop.

\(\blacksquare\)

## 35.12 Observing children with ADHD

A number of children with ADHD were observed by their mother or their father (only one parent observed each child). Each parent was asked to rate occurrences of behaviours of four different types, labelled `q1`

through `q4`

in the data set. Also recorded was the identity of the parent doing the observation for each child: 1 is father, 2 is mother.

Can we tell (without looking at the `parent`

column) which parent is doing the observation? Research suggests that rating the degree of impairment in different categories depends on who is doing the rating: for example, mothers may feel that a child has difficulty sitting still, while fathers, who might do more observing of a child at play, might think of such a child as simply being “active” or “just being a kid”. The data are in link.

- Read in the data and confirm that you have four ratings and a column labelling the parent who made each observation.

Solution

As ever:

```
<- "https://raw.githubusercontent.com/nxskok/datafiles/master/adhd-parents.txt"
my_url <- read_delim(my_url, " ") adhd
```

```
Rows: 29 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): parent
dbl (4): q1, q2, q3, q4
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` adhd`

Yes, exactly that.

\(\blacksquare\)

- Run a suitable discriminant analysis and display the output.

Solution

This is as before:

```
.1 <- lda(parent ~ q1 + q2 + q3 + q4, data = adhd)
adhd.1 adhd
```

```
Call:
lda(parent ~ q1 + q2 + q3 + q4, data = adhd)
Prior probabilities of groups:
father mother
0.1724138 0.8275862
Group means:
q1 q2 q3 q4
father 1.800 1.000000 1.800000 1.800
mother 2.375 2.791667 1.958333 1.625
Coefficients of linear discriminants:
LD1
q1 -0.3223454
q2 2.3219448
q3 0.1411360
q4 0.1884613
```

\(\blacksquare\)

- Which behaviour item or items seem to be most helpful at distinguishing the parent making the observations? Explain briefly.

Solution

Look at the Coefficients of Linear Discriminants. The coefficient of `q2`

, 2.32, is much larger in size than the others, so it’s really `q2`

that distinguishes mothers and fathers. Note also that the group means for fathers and mothers are fairly close on all the items except for `q2`

, which are a long way apart. So that’s another hint that it might be `q2`

that makes the difference. But that might be deceiving: one of the other `q`

s, even though the means are close for mothers and fathers, might actually do a good job of distinguishing mothers from fathers, because it has a small SD overall.

\(\blacksquare\)

- Obtain the predictions from the
`lda`

, and make a suitable plot of the discriminant scores, bearing in mind that you only have one`LD`

. Do you think there will be any misclassifications? Explain briefly.

Solution

The prediction is the obvious thing. I take a quick look at it (using `glimpse`

), but only because I feel like it:

```
.2 <- predict(adhd.1)
adhdglimpse(adhd.2)
```

```
List of 3
$ class : Factor w/ 2 levels "father","mother": 1 2 1 2 2 2 2 2 2 2 ...
$ posterior: num [1:29, 1:2] 9.98e-01 5.57e-06 9.98e-01 4.97e-02 4.10e-05 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:29] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "father" "mother"
$ x : num [1:29, 1] -3.327 1.357 -3.327 -0.95 0.854 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:29] "1" "2" "3" "4" ...
.. ..$ : chr "LD1"
```

The discriminant scores are in the thing called `x`

in there. There is only `LD1`

(only two groups, mothers and fathers), so the right way to plot it is against the true groups, eg. by a boxplot, first making a data frame, using `data.frame`

, containing what you need:

```
<- cbind(adhd, adhd.2)
d head(d)
```

`ggplot(d, aes(x = parent, y = LD1)) + geom_boxplot()`

The fathers look to be a very compact group with `LD1`

score around \(-3\), so I don’t foresee any problems there. The mothers, on the other hand, have outliers: there is one with `LD1`

score beyond \(-3\) that will certainly be mistaken for a father. There are a couple of other unusual `LD1`

scores among the mothers, but a rule like “anything above \(-2\) is called a mother, anything below is called a father” will get these two right. So I expect that the one very low mother will get misclassified, but that’s the only one.

\(\blacksquare\)

- Obtain the predicted group memberships and make a table of actual vs. predicted. Were there any misclassifications? Explain briefly.

Solution

Use the predictions from the previous part, and the observed `parent`

values from the original data frame. Then use either `table`

or `tidyverse`

to summarize.

`with(d, table(obs = parent, pred = class))`

```
pred
obs father mother
father 5 0
mother 1 23
```

Or,

`%>% count(parent, class) d `

or

```
%>%
d count(parent, class) %>%
pivot_wider(names_from=class, values_from=n, values_fill = list(n=0))
```

One of the mothers got classified as a father (evidently that one with a very negative `LD1`

score), but everything else is correct.

This time, by “explain briefly” I mean something like “tell me how you know there are or are not misclassifications”, or “describe any misclassifications that occur” or something like that.

Extra: I was curious — what is it about that one mother that caused her to get misclassified? (I didn’t ask you to think further about this, but in case you are curious as well.)

First, which mother *was* it? Let’s begin by adding the predicted classification to the data frame, and then we can query it by asking to see only the rows where the actual parent and the predicted parent were different. I’m also going to create a column `id`

that will give us the row of the *original* data frame:

```
%>%
d mutate(id = row_number()) %>%
filter(parent != class)
```

It was the original row 17. So what was unusual about this? We know from earlier that behaviour `q2`

was the one that generally distinguished mothers from fathers, so maybe we should find the mean and SD of scores for mothers and fathers on `q2`

:

`%>% group_by(parent) %>% summarize(m2 = mean(q2), s2 = sd(q2)) adhd `

The fathers’ scores on `q2`

were *all* 1, but the mothers’ scores on `q2`

were on average much higher. So it’s not really a surprise that this mother was mistaken for a father.

\(\blacksquare\)

- Re-run the discriminant analysis using cross-validation, and again obtain a table of actual and predicted parents. Is the pattern of misclassification different from before? Hints: (i) Bear in mind that there is no
`predict`

step this time, because the cross-validation output includes predictions; (ii) use a different name for the predictions this time because we are going to do a comparison in a moment.

Solution

So, this, with different name:

```
.3 <- lda(parent ~ q1 + q2 + q3 + q4, data = adhd, CV = T)
adhd<- cbind(adhd, class = adhd.3$class, posterior = adhd.3$posterior)
dd with(dd, table(parent, class))
```

```
class
parent father mother
father 5 0
mother 1 23
```

It’s exactly the same pattern of misclassification. (In fact, it’s exactly the same mother being misclassified as a father.)

This one is the same *not* because of having lots of data. In fact, as you see below, having a small data set makes quite a bit of difference to the posterior probabilities (where they are not close to 1 or 0), but the decisions about whether the parents are a mother or a father are clear-cut enough that none of *those* change. Even though (some of) the posterior probabilities are noticeably changed, which one is the bigger has not changed at all.

\(\blacksquare\)

- Display the original data (that you read in from the data file) side by side with two sets of posterior probabilities: the ones that you obtained with
`predict`

before, and the ones from the cross-validated analysis. Comment briefly on whether the two sets of posterior probabilities are similar. Hints: (i) use`data.frame`

rather than`cbind`

, for reasons that I explain elsewhere; (ii) round the posterior probabilities to 3 decimals before you display them. There are only 29 rows, so look at them all. I am going to add the`LD1`

scores to my output and sort by that, but you don’t need to. (This is for something I am going to add later.)

Solution

We have two data frames, `d`

and `dd`

^{12}

that respectively contain everything from the (original) `lda`

output and the cross-validated output. Let’s glue them together, look at what we have, and then pull out what we need:

```
<- data.frame(d, dd)
all head(all)
```

The ones with a 1 on the end are the cross-validated ones. We need the posterior probabilities, rounded, and they need to have shorter names:

```
%>%
all select(parent, starts_with("posterior"), LD1) %>%
mutate(across(starts_with("posterior"), \(x) round(x, 3))) %>%
rename_with(
~ str_replace(., "posterior", "p"),
starts_with("posterior"),
%>%
) arrange(LD1)
```

The `rename_with`

changes the names of the columns that start with `posterior`

to start with `p`

instead (shorter). I learned about this today (having wondered whether it existed or not), and it took about three goes for me to get it right.^{13} The first column is the actual parent; the other five columns are: the posterior probabilities from before, for father and for mother (two columns), and the posterior probabilities from cross-validation for father and for mother (two more columns), and the LD1 scores from before, sorted into order. You might have these the other way around from me, but in any case you ought to make it clear which is which. I included the `LD1`

scores for my discussion below; you don’t need to. Are the two sets of posterior probabilities similar? Only kinda. The ones at the top and bottom of the list are without doubt respectively fathers at the top of the list (top 5 rows on my sorted output, except that one of those is actually a mother), or mothers at the bottom, from row 10 down. But for rows 6 through 9, the posterior probabilities are not that similar. The most dissimilar ones are in row 4, where the regular `lda`

gives a posterior probability of 0.050 that the parent is a father, but under cross-validation that goes all the way up to 0.236. I think this is one of those mothers that is a bit like a father: her score on `q2`

was only 2, compared to 3 for most of the mothers. If you take out this mother, as cross-validation does, there are noticeably fewer `q2=2`

mothers left, so the observation looks more like a father than it would otherwise.

\(\blacksquare\)

- Row 17 of your (original) data frame above, row 5 of the output in the previous part, is the mother that was misclassified as a father. Why is it that the cross-validated posterior probabilities are 1 and 0, while the previous posterior probabilities are a bit less than 1 and a bit more than 0?

Solution

In making the classification, the non-cross-validated procedure uses all the data, so that parent #17 suggests that the mothers are very variable on `q2`

, so it is conceivable (though still unlikely) that this parent actually is a mother. Under cross-validation, however, parent #17 is *omitted*. This mother is nothing like any of the other mothers, or, to put it another way, the remaining mothers as a group are very far away from this one, so #17 doesn’t look like a mother *at all*.

\(\blacksquare\)

- Find the parents where the cross-validated posterior probability of being a father is “non-trivial”: that is, not close to zero and not close to 1. (You will have to make a judgement about what “close to zero or 1” means for you.) What do these parents have in common, all of them or most of them?

Solution

Let’s add something to the output we had before: the original scores on `q1`

through `q4`

:

```
%>%
all select(q1:q4, parent, starts_with("posterior"), LD1) %>%
mutate(across(starts_with("posterior"), \(x) round(x, 3))) %>%
rename_with(
str_replace(y, "posterior", "p"),
\(y) starts_with("posterior")) %>%
arrange(LD1)
```

To my mind, the “non-trivial” posterior probabilities are in rows 5 through 9. (You might have drawn the line in a different place.) These are the ones where there was some doubt, though maybe only a little, about which parent actually gave the ratings. For three of these, the parent (that was actually a mother) gave a rating of 2 on `q2`

. These were the only 2’s on `q2`

. The others were easy to call: “mother” if 3 and “father” if 1, and you’d get them all right except for that outlying mother. The clue in looking at `q2`

was that we found earlier that `LD1`

contained mostly `q2`

, so that it was mainly `q2`

that separated the fathers and mothers. If you found something else that the “non-trivial” rows had in common, that is good too, but I think looking at `q2`

is your quickest route to an answer. (`q1=1`

picks out some of these, but not all of them.) This is really the same kind of issue as we discussed when comparing the posterior probabilities for `lda`

and cross-validation above: there were only a few parents with `q2=2`

, so the effect there is that under cross-validation, there are even fewer when you take one of them out.

\(\blacksquare\)

## 35.13 Growing corn

A new type of corn seed has been developed. The people developing it want to know if the type of soil the seed is planted in has an impact on how well the seed performs, and if so, what kind of impact. Three outcome measures were used: the yield of corn produced (from a fixed amount of seed), the amount of water needed, and the amount of herbicide needed. The data are in link. 32 fields were planted with the seed, 8 fields with each soil type.

- Read in the data and verify that you have 32 observations with the correct variables.

Solution

The usual:

```
<- "https://raw.githubusercontent.com/nxskok/datafiles/master/cornseed.csv"
my_url <- read_csv(my_url) cornseed
```

```
Rows: 32 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): soil
dbl (4): field, yield, water, herbicide
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` cornseed`

We have 32 rows; we have a categorical soil type, three numerical columns containing the yield, water and herbicide values, and we also have a label for each of the 32 fields (which is actually a number, but we don’t have to worry about that, since we won’t be using `field`

for anything).

\(\blacksquare\)

- Run a multivariate analysis of variance to see whether the type of soil has any effect on any of the variables. What do you conclude from it?

Solution

The usual thing: create the response, use `manova`

(or `Manova`

from `car`

if you like, but it’s not necessary):

```
<- with(cornseed, cbind(yield, water, herbicide))
response .1 <- manova(response ~ soil, data = cornseed)
cornseedsummary(cornseed.1)
```

```
Df Pillai approx F num Df den Df Pr(>F)
soil 3 0.5345 2.0234 9 84 0.04641 *
Residuals 28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

With a P-value (just) less than 0.05, soil type has some effect on the response variables: that is, it affects one or more of the three responses, or some combination of them. ANOVA conclusions are usually vague, and MANOVA conclusions are vaguer than most. We will try to improve on this. But with an only-just-significant P-value, we should not be expecting miracles.

We ought to check Box’s M test:

`summary(BoxM(response, cornseed$soil))`

```
Box's M Test
Chi-Squared Value = 12.07432 , df = 18 and p-value: 0.843
```

No problems with unequal variances or correlations, at least.

Here and below, `field`

is neither a response variable nor an explanatory variable; it is an experimental unit, so `field`

acts as an ID rather than anything else. So `field`

should not be part of any of the analyses; if it did appear, the only way it could is as a factor, for example if this was somehow a repeated measures analysis over the three response variables. In that case, `lmer`

, if you were going that way, would use `field`

as a random effect.

The variables to include are the yield, water and herbicide as measured response variables, and soil type, as the categorical explanatory variable. (For the discriminant analysis, these get turned around: the grouping variable `soil`

acts like a response and the others act as explanatory.)

\(\blacksquare\)

- Run a discriminant analysis on these data, “predicting” soil type from the three response variables. Display the results.

Solution

```
.2 <- lda(soil ~ yield + water + herbicide, data = cornseed)
cornseed.2 cornseed
```

```
Call:
lda(soil ~ yield + water + herbicide, data = cornseed)
Prior probabilities of groups:
clay loam salty sandy
0.25 0.25 0.25 0.25
Group means:
yield water herbicide
clay 58.8375 33.0875 4.0875
loam 69.7125 32.7375 6.3875
salty 55.3125 30.6375 3.8625
sandy 62.5750 28.2000 4.3500
Coefficients of linear discriminants:
LD1 LD2 LD3
yield -0.08074845 -0.02081174 -0.04822432
water 0.03759961 0.09598577 -0.03231897
herbicide -0.50654017 0.06979662 0.27281743
Proportion of trace:
LD1 LD2 LD3
0.9487 0.0456 0.0057
```

No `field`

in here, for reasons discussed above. (I’m not even sure how you *can* run a discriminant analysis with a factor variable on the right of the squiggle.) The fields were numbered by soil type:

`%>% select(field, soil) cornseed `

so evidently if you know the field number you can guess the field type, but we didn’t care about that: we cared about whether you can distinguish the fields by yield, water, herbicide or combination thereof.

\(\blacksquare\)

- * Which linear discriminants seem to be worth paying attention to? Why did you get three linear discriminants? Explain briefly.

Solution

Look for “proportion of trace” in the output.

The first one is *way* bigger than the others, which says that the first linear discriminant is way more important (at separating the groups) than either of the other two.

As to why we got three: there are 3 variables and 4 groups (soil types), and the smaller of 3 and \(4-1\) is 3.

\(\blacksquare\)

- Which response variables do the important linear discriminants depend on? Answer this by extracting something from your discriminant analysis output.

Solution

The table “coefficients of linear discriminants”. We said earlier that the only important discriminant is `LD1`

. On that, the only notably non-zero coefficient is for `herbicide`

; the ones for `yield`

and `water`

are close to zero. That is to say, the effects of the soil types play out through herbicide and not either of the other two variables.

I didn’t ask you to, but you could check this by seeing how `herbicide`

differs according to soil type:

`ggplot(cornseed, aes(x = soil, y = herbicide)) + geom_boxplot()`

The fields in `loam`

soil needed more herbicide than the others.

Or by `water`

:

`ggplot(cornseed, aes(x = soil, y = water)) + geom_boxplot()`

There isn’t much difference in the amount of water needed between any of the fields, no matter what soil type.

This confirms that `water`

is not distinguished by soil type, while `herbicide`

is (at least to some extent).

\(\blacksquare\)

- Obtain predictions for the discriminant analysis. (You don’t need to do anything with them yet.)

Solution

Just this, therefore:

`<- predict(cornseed.2) cornseed.pred `

\(\blacksquare\)

- Plot the first two discriminant scores against each other, coloured by soil type. You’ll have to start by making a data frame containing what you need.

Solution

I changed my mind from the past about how to do this. I make a big data frame out of the data and predictions (with `cbind`

) and go from there:

```
<- cbind(cornseed, cornseed.pred)
d head(d)
```

Then we use this as input to `ggplot`

:

`ggplot(d, aes(x = x.LD1, y = x.LD2, colour = soil)) + geom_point()`

\(\blacksquare\)

- On your plot that you just made, explain briefly how
`LD1`

distinguishes at least one of the soil types.

Solution

Find a soil type that is typically high (or low or average) on LD1. Any one or more of these will do: loam soils are typically high on LD1, clay soils or salty soils are typically low on LD1; sandy soils are typically average on LD1. (There are exceptions, but I’m looking for “typically”.)

\(\blacksquare\)

- On your plot, does
`LD2`

appear to do anything to separate the groups? Is this surprising given your earlier findings? Explain briefly.

Solution

All the soil groups appear go to about the full height of the plot: that is to say, none of the groups appear to be especially at the top or the bottom. That means that `LD2`

does not separate the groups at all.

Back in part (here), we said that the first linear discriminant is way more important than either of the other two, and here we see what that means: `LD2`

does nothing to separate the groups. So it’s not a surprising finding at all.

I thought earlier about asking you to plot only the first linear discriminant, and now we see why: only the first one separates the groups. If you wanted to do that, you could make a boxplot of the discriminant scores by `soil`

group, thus:

`ggplot(d, aes(x = soil, y = x.LD1)) + geom_boxplot()`

This says more or less the same thing as your plot of `LD1`

and `LD2`

: `loam`

has the highest `LD1`

score, `sandy`

is about in the middle, and `clay`

and `salty`

have typically negative `LD1`

scores, similar to each other, though there is one outlying `salty`

that looks a lot more like a `loam`

.

\(\blacksquare\)

- Make a table of actual and predicted
`soil`

group. Which soil type was classified correctly the most often?

Solution

`with(d, table(obs = soil, pred = class))`

```
pred
obs clay loam salty sandy
clay 3 0 3 2
loam 0 6 0 2
salty 1 1 5 1
sandy 2 1 1 4
```

Or, the `tidyverse`

way, which is below.

There were 8 fields of each soil type. The soil type that has the most of its fields classified correctly (based on the values of the response variables) has the biggest number down the diagonal of the table: looking at 3, 6, 5 and 4, we see that the `loam`

soil type had the most of its fields classified correctly, so this was the most distinct from the others. (We also saw this on the plot of `LD1`

vs. `LD2`

: the `loam`

fields were all over on the right.)

This was easier because we had the same number of fields of each type. If we didn’t have that, the right way to go then would be to work out *row* percentages: “out of the fields that were actually sandy, what percent of them got classified as sandy”, and so on.

This is not a perfect classification, though, which is about what you would expect from the soil types being intermingled on the plot of `LD1`

vs. `LD2`

. If you look at the table, `salty`

and `sandy`

are fairly distinct also, but `clay`

is often confused with both of them. On the plot of `LD1`

and `LD2`

, `salty`

is generally to the left of `sandy`

, but `clay`

is mixed up with them both. The tidyverse way of doing this is equally good. This is the tidied-up way:

```
%>% count(soil, class) %>%
d pivot_wider(names_from = class, values_from = n, values_fill = 0)
```

Six out of eight `loam`

s were correctly classified, which is better than anything else.

Extra: we can calculate misclassification rates, first overall, which is easier:

```
%>%
d count(soil, class) %>%
mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>%
count(soil_stat, wt = n)
```

```
%>%
d count(soil, class) %>%
mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>%
count(soil_stat, wt = n) %>%
mutate(prop = nn / sum(nn))
```

```
Error in `mutate()`:
ℹ In argument: `prop = nn/sum(nn)`.
Caused by error:
! object 'nn' not found
```

Note the use of `wt`

on the second `count`

to count the number of *observations* from the first `count`

, not the number of *rows*.

This shows that 44% of the soil types were misclassified, which sounds awful, but is actually not so bad, considering. Bear in mind that if you were just guessing, you’d get 75% of them wrong, so getting 44% wrong is quite a bit better than that. The variables (especially `herbicide`

) are at least somewhat informative about soil type; it’s better to know them than not to.

Or do it by actual soil type:

```
%>%
d count(soil, class) %>%
group_by(soil) %>%
mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>%
count(soil_stat, wt = n)
```

```
%>%
d count(soil, class) %>%
group_by(soil) %>%
mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>%
count(soil_stat, wt = n) %>%
mutate(prop = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=soil_stat, values_from=prop)
```

Loam soil was the easiest to get right, and clay was easiest to get wrong. However, these proportions were each based on only eight observations, so it’s probably wise *not* to say that loam is *always* easiest to get right.

I didn’t have you look at posterior probabilities here.^{14} With 32 fields, this is rather a lot to list them all, but what we can do is to look at the ones that were misclassified (the true soil type differs from the predicted soil type). Before that, though, we need to make a data frame with the stuff in it that we want to look at. And before *that*, I want to round the posterior probabilities to a small number of decimals.

Then, we can fire away with this:

```
%>%
d mutate(across(starts_with("posterior"),
round(post, 3))) %>%
\(post) mutate(row = row_number()) -> dd
%>% filter(soil != class) dd
```

Most of the posterior probabilities are neither especially small nor especially large, which adds to the impression that things are really rather uncertain. For example, field 8 could have been either loam (0.352) or sandy (0.373). There was one field that was actually salty but looked like a loam one (with `LD1`

score around 2); this is field 20, that needed a lot of herbicide; it was rated to have an 82% chance of being loam and only 1% chance of salty.

Let’s remind ourselves of why we were doing this: the MANOVA was significant, so at least some of the fields were different on some of the variables from some of the others. What we found by doing the discriminant analysis was that only the first discriminant was of any value in distinguishing the soil types by the variables we measured, and *that* was mostly `herbicide`

. So the major effect that soil type had was on the amount of herbicide needed, with the loam soils needing most.

I wanted to finish with one more thing, which was to look again at the soils that were actually loam:

```
%>%
dd filter(soil == "loam") %>%
select(soil, yield, water, herbicide, class, starts_with("posterior"))
```

Fields 7 and 8 could have been pretty much any type of soil; `sandy`

came out with the highest posterior probability, so that’s what they were predicted (wrongly) to be. Some of the fields, 1, 3 and 5, were clearly (and correctly) loam. For 1 and 5, you can clearly see that this is because `herbicide`

was high, but field 3 is more of a mystery. For this field, `herbicide`

is *not* high, so one or more of the other variables must be pointing towards `loam`

.

We can obtain predicted `LD1`

scores for various combinations of “typical” values of the response variables and see what has what effect on `LD1`

:

`summary(cornseed)`

```
field soil yield water
Min. : 1.00 Length:32 Min. :45.00 Min. :14.50
1st Qu.: 8.75 Class :character 1st Qu.:50.58 1st Qu.:25.75
Median :16.50 Mode :character Median :61.40 Median :29.60
Mean :16.50 Mean :61.61 Mean :31.17
3rd Qu.:24.25 3rd Qu.:67.00 3rd Qu.:36.83
Max. :32.00 Max. :96.10 Max. :54.20
herbicide
Min. : 1.100
1st Qu.: 3.075
Median : 4.750
Mean : 4.672
3rd Qu.: 5.825
Max. :11.700
```

The problem is that the variables have different spreads. Let’s do some predictions (ie. calculations) of LD1 score for combinations of quartiles of our response variables. I like quartiles because these are “representative” values of the variables, typical of how far up and down they go. This process is one you’ve seen before:

```
<- c(51, 67)
yields <- c(26, 37)
waters <- c(3, 6)
herbicides <- datagrid(model = cornseed.2, yield = yields,
new water = waters, herbicide = herbicides)
<- predict(cornseed.2, new)
pred cbind(new, pred$x) %>% arrange(desc(LD1))
```

I arranged the predicted LD1 scores in descending order, so the most loam-like combinations are at the top. The top two combinations look like loam; they both have high `herbicide`

, as we figured before. But they also have high `yield`

. That might go some way towards explaining why field 3, with its non-high `herbicide`

, was confidently predicted to be `loam`

:

`%>% filter(field == 3) cornseed `

This has a very high `yield`

, and *that* is what is making us (correctly) think it is `loam`

.

I suddenly remembered that I hadn’t done a biplot of this one, which I could, since it’s a discriminant analysis:

`ggbiplot(cornseed.2, groups = cornseed$soil)`

This shows the dominant influence of `herbicide`

on LD1 score (more herbicide is more positive), and that `water`

has nothing to say (in terms of distinguishing soil types) and `yield`

has not much to say, their arrows being short. That observation with a non-high `herbicide`

that was predicted to be had the highest `yield`

of all, so even the small influence of `yield`

on `LD1`

made a big difference here.

\(\blacksquare\)

## 35.14 Understanding athletes’ height, weight, sport and gender

On a previous assignment, we used MANOVA on the athletes data to demonstrate that there was a significant relationship between the combination of the athletes’ height and weight, with the sport they play and the athlete’s gender. The problem with MANOVA is that it doesn’t give any information about the *kind* of relationship. To understand that, we need to do discriminant analysis, which is the purpose of this question.

The data can be found at link.

- Once again, read in and display (some of) the data, bearing in mind that the data values are separated by
*tabs*. (This ought to be a free two marks.)

Solution

Nothing new here:

```
<- "https://raw.githubusercontent.com/nxskok/datafiles/master/ais.txt"
my_url <- read_tsv(my_url) athletes
```

```
Rows: 202 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Sex, Sport
dbl (11): RCC, WCC, Hc, Hg, Ferr, BMI, SSF, %Bfat, LBM, Ht, Wt
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` athletes`

\(\blacksquare\)

- Use
`unite`

to make a new column in your data frame which contains the sport-gender*combination*. Display it. (You might like to display only a few columns so that it is clear that you did the right thing.) Hint: you’ve seen`unite`

in the peanuts example in class.

Solution

The columns to combine are called `Sport`

and `Sex`

, with Capital Letters. The syntax for `unite`

is that you give the name of the new combo column first, and then the names of the columns you want to combine, either by listing them or by using a select-helper. They will be separated by an underscore by default, which is usually easiest to handle.^{15} In `unite`

, you can group the columns to “unite” with `c()`

, as in class, or not, as here. Either way is good.^{16} We’ll be using height and weight in the analysis to come, so I decided to display just those:

```
%>%
athletes unite(combo, Sport, Sex) -> athletesc
%>% select(combo, Ht, Wt) athletesc
```

I gave the data frame a new name, since I might want to come back to the original later. Also, displaying only those columns gives more width for the display of my `combo`

, so that I can be sure I got it right.

Extra: there is another column, `SSF`

, that begins with S, so the select-helper thing is not so obviously helpful here. But the two columns we want start with S followed by either e or p, so we could do this:

```
%>%
athletes unite(combo, matches("^S(e|p)")) %>%
select(combo, Ht, Wt)
```

The `matches`

takes a so-called regular expression. This one says ``starting at the beginning of the column name, find an uppercase S followed by either a lowercase e or a lowercase p’’. This picks out the columns and only the columns we want. In the opposite order, though (either order is fine).

I have a feeling we can also take advantage of the fact that the two columns we want to `unite`

are the only two text ones:

```
%>%
athletes unite(combo, where(is.character))
```

I wasn’t expecting that to work!

\(\blacksquare\)

- Run a discriminant analysis “predicting” sport-gender combo from height and weight. Display the results. (No comment needed yet.)

Solution

That would be this. I’m having my familiar trouble with names:

`.1 <- lda(combo ~ Ht + Wt, data = athletesc) combo`

If you used a new name for the data frame with the sport-gender combinations in it, use that new name here.

The output:

`.1 combo`

```
Call:
lda(combo ~ Ht + Wt, data = athletesc)
Prior probabilities of groups:
BBall_female BBall_male Field_female Field_male Gym_female
0.06435644 0.05940594 0.03465347 0.05940594 0.01980198
Netball_female Row_female Row_male Swim_female Swim_male
0.11386139 0.10891089 0.07425743 0.04455446 0.06435644
T400m_female T400m_male Tennis_female Tennis_male TSprnt_female
0.05445545 0.08910891 0.03465347 0.01980198 0.01980198
TSprnt_male WPolo_male
0.05445545 0.08415842
Group means:
Ht Wt
BBall_female 182.2692 71.33077
BBall_male 195.5833 88.92500
Field_female 172.5857 80.04286
Field_male 185.2750 95.76250
Gym_female 153.4250 43.62500
Netball_female 176.0870 69.59348
Row_female 178.8591 72.90000
Row_male 187.5333 86.80667
Swim_female 173.1778 65.73333
Swim_male 185.6462 81.66154
T400m_female 169.3364 57.23636
T400m_male 179.1889 68.20833
Tennis_female 168.5714 58.22857
Tennis_male 183.9500 75.40000
TSprnt_female 170.4750 59.72500
TSprnt_male 178.5364 75.79091
WPolo_male 188.2235 86.72941
Coefficients of linear discriminants:
LD1 LD2
Ht 0.08898971 -0.1888615
Wt 0.06825230 0.1305246
Proportion of trace:
LD1 LD2
0.7877 0.2123
```

I comment here that there are two linear discriminants because there are two variables (height and weight) and actually 17 groups (not quite \(2\times 10\) because some sports are played by athletes of only one gender). The smaller of 2 and \(17-1\) is 2. (I often ask about this, but am choosing not to here.)

\(\blacksquare\)

- What kind of height and weight would make an athlete have a large (positive) score on
`LD1`

? Explain briefly.

Solution

The Coefficients of Linear Discriminants for `LD1`

are both positive, so an athlete with a large positive score on `LD1`

has a large height and weight: that is to say, they are tall and heavy.

\(\blacksquare\)

- Make a guess at the sport-gender combination that has the
*highest*score on LD1. Why did you choose the combination you did?

Solution

I could have made you guess the *smallest* score on LD1, but that would have been too easy (female gymnasts). For this one, you want a sport-gender combination that is typically tall and heavy, and you can look in the table of Group Means to help you find a candidate group. I think the two best guesses are male basketball players (tallest and nearly the heaviest) and male field athletes (heaviest and among the group of athletes that are second-tallest behind the male basketball players). I don’t so much mind what you guess, as long as you make a sensible call about a group that is reasonably tall and reasonably heavy (or, I suppose, that matches with what you said in the previous part, whatever that was).

\(\blacksquare\)

*What combination of height and weight would make an athlete have a*small* (that is, very negative) score on LD2? Explain briefly.

Solution

The italics in the question are something to do with questions that have a link to them in Bookdown. I don’t know how to fix that. Going back to the Coefficients of Linear Discriminants, the coefficient for Height is negative, and the one for Weight is positive. What will make an athlete come out small (very negative) on this is if they have a *large* height and a *small* weight. To clarify your thinking on this, think of the heights and weights as being standardized, so that a big one will be positive and a small one will be negative. To make `LD2`

very negative, you want a “plus” height to multiply the minus sign, and a “minus” weight multiplying the plus sign. Extra: what is happening here is that `LD1`

gives the most important way in which the groups differ, and `LD2`

the next-most important. There is generally a positive correlation between height and weight (taller athletes are generally heavier), so the most important “dimension” is the big-small one with tall heavy athletes at one end and short light athletes at the other. The `Proportion of trace`

in the output says that `LD1`

is definitely more important, in terms of separating the groups, than `LD2`

is, but the latter still has *some* value.

\(\blacksquare\)

- Obtain predictions for the discriminant analysis, and use these to make a plot of
`LD1`

score against`LD2`

score, with the individual athletes distinguished by what sport they play and gender they are. (You can use colour to distinguish them, or you can use shapes. If you want to go the latter way, there are clues in my solutions to the MANOVA question about these athletes.)

Solution

The prediction part is only one step:

`<- predict(combo.1) p `

One point for this.

This, in case you are wondering, is obtaining predicted group membership and LD scores for the original data, that is, for our 202 athletes.

I prefer (no obligation) to take a look at what I have. My `p`

is actually a `list`

:

`class(p)`

`[1] "list"`

`glimpse(p)`

```
List of 3
$ class : Factor w/ 17 levels "BBall_female",..: 12 6 6 6 7 6 6 6 6 6 ...
$ posterior: num [1:202, 1:17] 0.1235 0.0493 0.084 0.0282 0.1538 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:202] "1" "2" "3" "4" ...
.. ..$ : chr [1:17] "BBall_female" "BBall_male" "Field_female" "Field_male" ...
$ x : num [1:202, 1:2] -1.325 -1.487 -0.96 -1.885 0.114 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:202] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "LD1" "LD2"
```

Our standard procedure is to `cbind`

the predictions together with the original data (including the combo), and get a huge data frame (in this case):

```
<- cbind(athletesc, p)
d head(d)
```

And so, to the graph:

`ggplot(d, aes(x = x.LD1, y = x.LD2, colour = combo)) + geom_point()`

If you can distinguish seventeen different colours, your eyes are better than mine! You might prefer to use seventeen different shapes, although I wonder how much better that will be:

```
ggplot(d, aes(x = x.LD1, y = x.LD2, shape = combo)) + geom_point() +
scale_shape_manual(values = 1:17)
```

You have to do something special to get as many as seventeen shapes. This idea came from the MANOVA question in the last assignment.

Or even this:

```
ggplot(d, aes(x = x.LD1, y = x.LD2, shape = combo, colour = combo)) + geom_point() +
scale_shape_manual(values = 1:17)
```

Perhaps having colours *and* shapes makes the combos easier to distinguish. We’re beginning to stray onto the boundary between statistics and aesthetics here!

Extra: earlier, I asked you to guess which group(s) of athletes had a high (positive) score on LD1. These are the ones on the right side of this plot: male basketball players bottom right and male field athletes top right. Was that what you guessed? What about the other guesses you might have made?

\(\blacksquare\)

- Look on your graph for the four athletes with the smallest (most negative) scores on
`LD2`

. What do they have in common? Does this make sense, given your answer to part (here)? Explain briefly.

Solution

These are the four athletes at the bottom of the plot. If you can distinguish the colours, two of these are red and two of them are orange, so they are all basketball players (two male and two female). If you plotted the shapes, and you used the same shapes I did, two of them are circles and the other two are upward-facing triangles, leading you to the same conclusion. (You could also denote each combo by a letter and plot with those letters, as per the solutions to the last assignment.) Back in part (here), I said that what would make an athlete come out very negative on `LD2`

is if they were *tall* and *not heavy*. This is the stereotypical description of a basketball player, so it makes perfect sense to me.

Extra: some basketball players are tall and *heavier*; these are the ones on the right of the plot, with a larger `LD1`

score, to reflect that they are both tall and heavy, but with an `LD2`

score closer to zero, reflecting that, given how tall they are, their weight is about what you’d expect. LD2 is really saying something like “weight relative to height”, with someone at the top of the picture being unusually heavy and someone at the bottom unusually light.

\(\blacksquare\)

- Obtain a (very large) square table, or a (very long) table with frequencies, of actual and predicted sport-gender combinations. You will probably have to make the square table very small to fit it on the page. For that, displaying the columns in two or more sets is OK (for example, six columns and all the rows, six more columns and all the rows, then the last five columns for all the rows). Are there any sport-gender combinations that seem relatively easy to classify correctly? Explain briefly.

Solution

Let’s see what happens:

```
<- with(d, table(combo, class))
tab tab
```

```
class
combo BBall_female BBall_male Field_female Field_male Gym_female
BBall_female 3 1 0 0 0
BBall_male 0 9 0 0 0
Field_female 0 0 5 0 0
Field_male 0 1 0 7 0
Gym_female 0 0 0 0 4
Netball_female 0 0 1 0 0
Row_female 0 0 0 0 1
Row_male 0 2 0 1 0
Swim_female 0 0 0 0 0
Swim_male 0 4 0 0 0
T400m_female 0 0 0 0 0
T400m_male 3 1 0 0 0
Tennis_female 0 0 1 0 1
Tennis_male 1 0 0 0 0
TSprnt_female 0 0 0 0 0
TSprnt_male 0 0 0 0 0
WPolo_male 1 3 0 2 0
class
combo Netball_female Row_female Row_male Swim_female Swim_male
BBall_female 5 1 0 0 0
BBall_male 0 0 0 0 2
Field_female 1 0 0 0 0
Field_male 0 2 0 0 0
Gym_female 0 0 0 0 0
Netball_female 13 4 0 0 0
Row_female 5 10 0 0 1
Row_male 0 0 1 0 0
Swim_female 4 1 0 0 0
Swim_male 2 3 0 0 0
T400m_female 3 0 0 0 0
T400m_male 5 3 0 0 0
Tennis_female 2 0 0 0 0
Tennis_male 0 3 0 0 0
TSprnt_female 1 0 0 0 0
TSprnt_male 6 3 0 0 0
WPolo_male 0 3 1 0 0
class
combo T400m_female T400m_male Tennis_female Tennis_male
BBall_female 0 2 0 0
BBall_male 0 0 0 0
Field_female 1 0 0 0
Field_male 0 0 0 0
Gym_female 0 0 0 0
Netball_female 1 4 0 0
Row_female 0 4 0 0
Row_male 1 0 0 0
Swim_female 3 1 0 0
Swim_male 0 1 0 0
T400m_female 6 2 0 0
T400m_male 1 5 0 0
Tennis_female 2 1 0 0
Tennis_male 0 0 0 0
TSprnt_female 2 1 0 0
TSprnt_male 0 0 0 0
WPolo_male 0 0 0 0
class
combo TSprnt_female TSprnt_male WPolo_male
BBall_female 0 0 1
BBall_male 0 0 1
Field_female 0 0 0
Field_male 0 0 2
Gym_female 0 0 0
Netball_female 0 0 0
Row_female 0 0 1
Row_male 0 0 10
Swim_female 0 0 0
Swim_male 0 0 3
T400m_female 0 0 0
T400m_male 0 0 0
Tennis_female 0 0 0
Tennis_male 0 0 0
TSprnt_female 0 0 0
TSprnt_male 0 0 2
WPolo_male 0 0 7
```

That’s kind of long.

For combos that are easy to classify, you’re looking for a largish number on the diagonal of the table (classified correctly), bearing in mind that you only see about four columns of the table at once, and (much) smaller numbers in the rest of the row and column. I don’t mind which ones you pick out, but see if you can find a few:

Male basketball players (9 out of 12 classified correctly)

Male field athletes (7 out of 10 classified correctly)

Female netball players (13 out of about 23)

Female rowers (10 out of about 22)

Or you can turn it into a tibble:

`%>% as_tibble() tab `

This makes the `tidyverse`

output, with frequencies. You probably want to omit the zero ones:

`%>% as_tibble() %>% filter(n > 0) tab `

This is the same output as below. See there for comments.

The other, perhaps easier, way to tackle this one is the `tidyverse`

way, making a “long” table of frequencies. Here is some of it. You’ll be able to click to see more:

`%>% count(combo, class) d `

The zeroes never show up here. The `combo`

column is the truth, and the `class`

column is the prediction. Again, you can see where the big frequencies are; a lot of the female netball players were gotten right, but there were a lot of them to begin with.

Extra: let’s see if we can work out *proportions* correct. I’ve changed my mind from how I originally wrote this. I still use `count`

, but I start with the overall misclassification. Let’s take it in steps:

```
%>%
d count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong"))
```

That makes a new column `stat`

that contains whether the predicted sport-gender combination was correct or wrong. For an overall misclassification rate we have to count these, but *not* simply counting the number of rows; rather, we need to total up the things in the `n`

column:

```
%>%
d count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n)
```

This tells us how many predictions overall were right and how many wrong.

To make those into proportions, another `mutate`

, dividing by the total of `n`

:

```
%>%
d count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n))
```

65% of the sport-gender combinations were misclassified. This is awful, but is a lot better than guessing (we’d then get about 5% of them right and about 95% wrong).

There’s a subtlety here that will make sense when we do the corresponding calculation by sport-gender combination. To do *that*, we put a `group_by(combo)`

either before or after we define `stat`

(it doesn’t matter which way):

```
%>%
d count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n))
```

That last `sum(n)`

: what is it summing over? The answer is “within `combo`

”, since that is the `group_by`

. You see that the two `proportion`

values within, say, `BBall_female`

, add up to 1.

We don’t actually see all the answers, because there are too many of them. Let’s try to get the proportion correct and wrong in their own columns. This almost works:

```
%>%
d count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
pivot_wider(names_from=stat, values_from=proportion)
```

This doesn’t work because everything outside of the `pivot_wider`

is tested for uniqueness; if it’s unique, it gets its own row. Thus, `BBall_male`

and 3 is different from `BBall_male`

and 9. But we only want one row of `BBall_male`

. I think the easiest way around this is to get rid of `n`

, since it has served its purpose:

```
%>%
d count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=stat, values_from=proportion, values_fill = list(proportion=0))
```

One extra thing: some of the `proportion`

values were missing, because there weren’t any misclassified (or maybe correctly-classified!) athletes. The `values_fill`

sets any missings in `proportion`

to zero.

While we’re about it, let’s arrange in order of misclassification probability:

```
%>%
d count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=stat, values_from=proportion, values_fill = list(proportion=0)) %>%
replace_na(list(correct = 0, wrong = 0)) %>%
arrange(wrong)
```

The most distinctive athletes were the female gymnasts (tiny!), followed by the male basketball players (tall) and the female field athletes (heavy). These were easiest to predict from their height and weight. The ones at the bottom of the list were very confusible since the discriminant analysis guessed them all wrong! So what were the most common *misclassifications*? Let’s go back to this:

`head(d)`

```
%>%
d count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong"))
```

We want to express those `n`

values as proportions out of their actual sport-gender combo, so we group by `combo`

before defining the proportions:

```
%>%
d count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
group_by(combo) %>%
mutate(proportion = n / sum(n))
```

Only pick out the ones that were gotten wrong, and arrange the remaining proportions in descending order:

```
%>%
d count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
group_by(combo) %>%
mutate(proportion = n / sum(n)) %>%
filter(stat == "wrong") %>%
arrange(desc(proportion))
```

The embarrassment champion is the three male tennis players that were taken to be — female rowers! Most of the other mistakes are more forgivable: the male rowers being taken for male water polo players, for example.

\(\blacksquare\)

Grammatically, I am supposed to write this as “two hundred and forty-four” in words, since I am not supposed to start a sentence with a number. But, I say, deal with it. Or, I suppose, “there are 244 people who work…”.↩︎

If you run into an error like “Argument 2 must have names” here, that means that the second thing,

`class`

, needs to have a name and doesn’t have one.↩︎This was why we were doing discriminant analysis in the first place.↩︎

That sounds like the ultimate in evasiveness!↩︎

Every time I do this, I forget to put the

`summary`

around the outside, so I get the long ugly version of the output rather than the short pretty version.↩︎For suitable definitions of fun.↩︎

Grammatically, I am supposed to write this as “two hundred and forty-four” in words, since I am not supposed to start a sentence with a number. But, I say, deal with it. Or, I suppose, “there are 244 people who work…”.↩︎

Until much later.↩︎

When you’re

*recording*the data, you may find it convenient to use short codes to represent the possibly long factor levels, but in that case you should also use a**codebook**so that you know what the codes represent. When I read the data into R, I would create a factor with named levels, like I did here, if I don’t already have one.↩︎I discovered that I used

*pp*twice, and I want to use the first one again later, so I had to rename this one.↩︎So we should do it, when assessing how good the classification is.↩︎

I have to learn to come up with better names.↩︎

`str-replace`

is from`stringr`

, and takes three inputs: a piece of text, the text to look for, and the text to replace it with. The piece of text in this case is one of the columns whose name starts with`posterior`

; the dot represents*it*, that is to say “one of the columns whose name starts with`posterior`

.↩︎Rest assured that I did on the final exam!↩︎

The opposite of

`unite`

is`separate`

, which splits a combined column like my`combo`

into separate columns; it too uses underscore as the default separator.↩︎You used to have to group them, but you don’t any more. Hence my old code has them grouped, but my new code does not.↩︎