# 35Discriminant analysis

Packages for this chapter:

library(ggbiplot)
library(MASS)
library(tidyverse)
library(car)

(Note: ggbiplot loads plyr, which overlaps a lot with dplyr (filter, select etc.). We want the dplyr stuff elsewhere, so we load ggbiplot first, and the things in plyr get hidden, as shown in the Conflicts. This, despite appearances, is what we want.)

## 35.1 Telling whether a banknote is real or counterfeit

* A Swiss bank collected a number of known counterfeit (fake) bills over time, and sampled a number of known genuine bills of the same denomination. Is it possible to tell, from measurements taken from a bill, whether it is genuine or not? We will explore that issue here. The variables measured were:

• length

• right-hand width

• left-hand width

• top margin

• bottom margin

• diagonal

1. Read in the data from link, and check that you have 200 rows and 7 columns altogether.

2. Run a multivariate analysis of variance. What do you conclude? Is it worth running a discriminant analysis? (This is the same procedure as with basic MANOVAs before.)

3. Run a discriminant analysis. Display the output.

4. How many linear discriminants did you get? Is that making sense? Explain briefly.

5. * Using your output from the discriminant analysis, describe how each of the linear discriminants that you got is related to your original variables. (This can, maybe even should, be done crudely: “does each variable feature in each linear discriminant: yes or no?”.)

6. What values of your variable(s) would make LD1 large and positive?

7. * Find the means of each variable for each group (genuine and counterfeit bills). You can get this from your fitted linear discriminant object.

8. Plot your linear discriminant(s), however you like. Bear in mind that there is only one linear discriminant.

9. What kind of score on LD1 do genuine bills typically have? What kind of score do counterfeit bills typically have? What characteristics of a bill, therefore, would you look at to determine if a bill is genuine or counterfeit?

## 35.2 Urine and obesity: what makes a difference?

A study was made of the characteristics of urine of young men. The men were classified into four groups based on their degree of obesity. (The groups are labelled a, b, c, d.) Four variables were measured, x (which you can ignore), pigment creatinine, chloride and chlorine. The data are in link as a .csv file. There are 45 men altogether.

Yes, you may have seen this one before. What you found was something like this, probably also with the Box M test (which has a P-value that is small, but not small enough to be a concern):

my_url <- "http://ritsokiguess.site/datafiles/urine.csv"
urine <- read_csv(my_url)
Rows: 45 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): obesity
dbl (4): x, creatinine, chloride, chlorine

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
response <- with(urine, cbind(creatinine, chlorine, chloride))
urine.1 <- manova(response ~ obesity, data = urine)
summary(urine.1)
          Df  Pillai approx F num Df den Df  Pr(>F)
obesity    3 0.43144   2.2956      9    123 0.02034 *
Residuals 41
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our aim is to understand why this result was significant.

1. Read in the data again (copy the code from above) and obtain a discriminant analysis.

2. How many linear discriminants were you expecting? Explain briefly.

3. Why do you think we should pay attention to the first two linear discriminants but not the third? Explain briefly.

4. Plot the first two linear discriminant scores (against each other), with each obesity group being a different colour.

5. * Looking at your plot, discuss how (if at all) the discriminants separate the obesity groups. (Where does each obesity group fall on the plot?)

6. * Obtain a table showing observed and predicted obesity groups. Comment on the accuracy of the predictions.

7. Do your conclusions from (here) and (here) appear to be consistent?

## 35.3 Understanding a MANOVA

One use of discriminant analysis is to understand the results of a MANOVA. This question is a followup to a previous MANOVA that we did, the one with two variables y1 and y2 and three groups a through c. The data were in link.

1. Read the data in again and run the MANOVA that you did before.

2. Run a discriminant analysis “predicting” group from the two response variables. Display the output.

3. * In the output from the discriminant analysis, why are there exactly two linear discriminants LD1 and LD2?

4. * From the output, how would you say that the first linear discriminant LD1 compares in importance to the second one LD2: much more important, more important, equally important, less important, much less important? Explain briefly.

5. Obtain a plot of the discriminant scores.

6. Describe briefly how LD1 and/or LD2 separate the groups. Does your picture confirm the relative importance of LD1 and LD2 that you found back in part (here)? Explain briefly.

7. What makes group a have a low score on LD1? There are two steps that you need to make: consider the means of group a on variables y1 and y2 and how they compare to the other groups, and consider how y1 and y2 play into the score on LD1.

8. Obtain predictions for the group memberships of each observation, and make a table of the actual group memberships against the predicted ones. How many of the observations were wrongly classified?

## 35.4 What distinguishes people who do different jobs?

2441 people work at a certain company. They each have one of three jobs: customer service, mechanic, dispatcher. In the data set, these are labelled 1, 2 and 3 respectively. In addition, they each are rated on scales called outdoor, social and conservative. Do people with different jobs tend to have different scores on these scales, or, to put it another way, if you knew a person’s scores on outdoor, social and conservative, could you say something about what kind of job they were likely to hold? The data are in link.

1. Read in the data and display some of it.

2. Note the types of each of the variables, and create any new variables that you need to.

3. Run a multivariate analysis of variance to convince yourself that there are some differences in scale scores among the jobs.

4. Run a discriminant analysis and display the output.

5. Which is the more important, LD1 or LD2? How much more important? Justify your answer briefly.

6. Describe what values for an individual on the scales will make each of LD1 and LD2 high.

7. The first group of employees, customer service, have the highest mean on social and the lowest mean on both of the other two scales. Would you expect the customer service employees to score high or low on LD1? What about LD2?

8. Plot your discriminant scores (which you will have to obtain first), and see if you were right about the customer service employees in terms of LD1 and LD2. The job names are rather long, and there are a lot of individuals, so it is probably best to plot the scores as coloured circles with a legend saying which colour goes with which job (rather than labelling each individual with the job they have).

9. * Obtain predicted job allocations for each individual (based on their scores on the three scales), and tabulate the true jobs against the predicted jobs. How would you describe the quality of the classification? Is that in line with what the plot would suggest?

10. Consider an employee with these scores: 20 on outdoor, 17 on social and 8 on conservative What job do you think they do, and how certain are you about that? Use predict, first making a data frame out of the values to predict for.

11. Since I am not making you hand this one in, I’m going to keep going. Re-run the analysis to incorporate cross-validation, and make a table of the predicted group memberships. Is it much different from the previous one you had? Why would that be?

## 35.5 Observing children with ADHD

A number of children with ADHD were observed by their mother or their father (only one parent observed each child). Each parent was asked to rate occurrences of behaviours of four different types, labelled q1 through q4 in the data set. Also recorded was the identity of the parent doing the observation for each child: 1 is father, 2 is mother.

Can we tell (without looking at the parent column) which parent is doing the observation? Research suggests that rating the degree of impairment in different categories depends on who is doing the rating: for example, mothers may feel that a child has difficulty sitting still, while fathers, who might do more observing of a child at play, might think of such a child as simply being “active” or “just being a kid”. The data are in link.

1. Read in the data and confirm that you have four ratings and a column labelling the parent who made each observation.

2. Run a suitable discriminant analysis and display the output.

3. Which behaviour item or items seem to be most helpful at distinguishing the parent making the observations? Explain briefly.

4. Obtain the predictions from the lda, and make a suitable plot of the discriminant scores, bearing in mind that you only have one LD. Do you think there will be any misclassifications? Explain briefly.

5. Obtain the predicted group memberships and make a table of actual vs. predicted. Were there any misclassifications? Explain briefly.

6. Re-run the discriminant analysis using cross-validation, and again obtain a table of actual and predicted parents. Is the pattern of misclassification different from before? Hints: (i) Bear in mind that there is no predict step this time, because the cross-validation output includes predictions; (ii) use a different name for the predictions this time because we are going to do a comparison in a moment.

7. Display the original data (that you read in from the data file) side by side with two sets of posterior probabilities: the ones that you obtained with predict before, and the ones from the cross-validated analysis. Comment briefly on whether the two sets of posterior probabilities are similar. Hints: (i) use data.frame rather than cbind, for reasons that I explain elsewhere; (ii) round the posterior probabilities to 3 decimals before you display them. There are only 29 rows, so look at them all. I am going to add the LD1 scores to my output and sort by that, but you don’t need to. (This is for something I am going to add later.)

8. Row 17 of your (original) data frame above, row 5 of the output in the previous part, is the mother that was misclassified as a father. Why is it that the cross-validated posterior probabilities are 1 and 0, while the previous posterior probabilities are a bit less than 1 and a bit more than 0?

9. Find the parents where the cross-validated posterior probability of being a father is “non-trivial”: that is, not close to zero and not close to 1. (You will have to make a judgement about what “close to zero or 1” means for you.) What do these parents have in common, all of them or most of them?

## 35.6 Growing corn

A new type of corn seed has been developed. The people developing it want to know if the type of soil the seed is planted in has an impact on how well the seed performs, and if so, what kind of impact. Three outcome measures were used: the yield of corn produced (from a fixed amount of seed), the amount of water needed, and the amount of herbicide needed. The data are in link. 32 fields were planted with the seed, 8 fields with each soil type.

1. Read in the data and verify that you have 32 observations with the correct variables.

2. Run a multivariate analysis of variance to see whether the type of soil has any effect on any of the variables. What do you conclude from it?

3. Run a discriminant analysis on these data, “predicting” soil type from the three response variables. Display the results.

4. * Which linear discriminants seem to be worth paying attention to? Why did you get three linear discriminants? Explain briefly.

5. Which response variables do the important linear discriminants depend on? Answer this by extracting something from your discriminant analysis output.

6. Obtain predictions for the discriminant analysis. (You don’t need to do anything with them yet.)

7. Plot the first two discriminant scores against each other, coloured by soil type. You’ll have to start by making a data frame containing what you need.

8. On your plot that you just made, explain briefly how LD1 distinguishes at least one of the soil types.

9. On your plot, does LD2 appear to do anything to separate the groups? Is this surprising given your earlier findings? Explain briefly.

10. Make a table of actual and predicted soil group. Which soil type was classified correctly the most often?

## 35.7 Understanding athletes’ height, weight, sport and gender

On a previous assignment, we used MANOVA on the athletes data to demonstrate that there was a significant relationship between the combination of the athletes’ height and weight, with the sport they play and the athlete’s gender. The problem with MANOVA is that it doesn’t give any information about the kind of relationship. To understand that, we need to do discriminant analysis, which is the purpose of this question.

The data can be found at link.

1. Once again, read in and display (some of) the data, bearing in mind that the data values are separated by tabs. (This ought to be a free two marks.)

2. Use unite to make a new column in your data frame which contains the sport-gender combination. Display it. (You might like to display only a few columns so that it is clear that you did the right thing.) Hint: you’ve seen unite in the peanuts example in class.

3. Run a discriminant analysis “predicting” sport-gender combo from height and weight. Display the results. (No comment needed yet.)

4. What kind of height and weight would make an athlete have a large (positive) score on LD1? Explain briefly.

5. Make a guess at the sport-gender combination that has the highest score on LD1. Why did you choose the combination you did?

6. What combination of height and weight would make an athlete have a small* (that is, very negative) score on LD2? Explain briefly.

7. Obtain predictions for the discriminant analysis, and use these to make a plot of LD1 score against LD2 score, with the individual athletes distinguished by what sport they play and gender they are. (You can use colour to distinguish them, or you can use shapes. If you want to go the latter way, there are clues in my solutions to the MANOVA question about these athletes.)

8. Look on your graph for the four athletes with the smallest (most negative) scores on LD2. What do they have in common? Does this make sense, given your answer to part (here)? Explain briefly.

9. Obtain a (very large) square table, or a (very long) table with frequencies, of actual and predicted sport-gender combinations. You will probably have to make the square table very small to fit it on the page. For that, displaying the columns in two or more sets is OK (for example, six columns and all the rows, six more columns and all the rows, then the last five columns for all the rows). Are there any sport-gender combinations that seem relatively easy to classify correctly? Explain briefly.

My solutions follow:

## 35.8 Telling whether a banknote is real or counterfeit

* A Swiss bank collected a number of known counterfeit (fake) bills over time, and sampled a number of known genuine bills of the same denomination. Is it possible to tell, from measurements taken from a bill, whether it is genuine or not? We will explore that issue here. The variables measured were:

• length

• right-hand width

• left-hand width

• top margin

• bottom margin

• diagonal

1. Read in the data from link, and check that you have 200 rows and 7 columns altogether.

Solution

Check the data file first. It’s aligned in columns, thus:

my_url <- "http://ritsokiguess.site/datafiles/swiss1.txt"
swiss <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
length = col_double(),
left = col_double(),
right = col_double(),
bottom = col_double(),
top = col_double(),
diag = col_double(),
status = col_character()
)
swiss

Yep, 200 rows and 7 columns.

$$\blacksquare$$

1. Run a multivariate analysis of variance. What do you conclude? Is it worth running a discriminant analysis? (This is the same procedure as with basic MANOVAs before.)

Solution

Small-m manova will do here:

response <- with(swiss, cbind(length, left, right, bottom, top, diag))
swiss.1 <- manova(response ~ status, data = swiss)
summary(swiss.1)
           Df  Pillai approx F num Df den Df    Pr(>F)
status      1 0.92415   391.92      6    193 < 2.2e-16 ***
Residuals 198
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(BoxM(response, swiss$status))  Box's M Test Chi-Squared Value = 121.8991 , df = 21 and p-value: 3.2e-16  That is a very significant Box’s M test, which means that we shouldn’t trust the MANOVA at all. It is only the fact that the MANOVA is so significant that provides any evidence that the discriminant analysis is worth doing. Extra: you might be wondering whether you had to go to all that trouble to make the response variable. Would this work? response2 <- swiss %>% select(length:diag) swiss.1a <- manova(response2 ~ status, data = swiss) Error in model.frame.default(formula = response2 ~ status, data = swiss, : invalid type (list) for variable 'response2' No, because response2 needs to be an R matrix, and it isn’t: class(response2) [1] "tbl_df" "tbl" "data.frame" The error message was a bit cryptic (nothing unusual there), but a data frame (to R) is a special kind of list, so that R didn’t like response2 being a data frame, which it thought was a list. This, however, works, since it turns the data frame into a matrix: response4 <- swiss %>% select(length:diag) %>% as.matrix() swiss.2a <- manova(response4 ~ status, data = swiss) summary(swiss.2a)  Df Pillai approx F num Df den Df Pr(>F) status 1 0.92415 391.92 6 193 < 2.2e-16 *** Residuals 198 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Anyway, the conclusion: the status of a bill (genuine or counterfeit) definitely has an influence on some or all of those other variables, since the P-value $$2.2 \times 10^{-16}$$ (or less) is really small. So it is apparently worth running a discriminant analysis to figure out where the differences lie. As a piece of strategy, for creating the response matrix, you can always either use cbind, which creates a matrix directly, or you can use select, which is often easier but creates a data frame, and then turn that into a matrix using as.matrix. As long as you end up with a matrix, it’s all good. $$\blacksquare$$ 1. Run a discriminant analysis. Display the output. Solution Now we forget about all that response stuff. For a discriminant analysis, the grouping variable (or combination of the grouping variables) is the “response”, and the quantitative ones are “explanatory”: swiss.3 <- lda(status ~ length + left + right + bottom + top + diag, data = swiss) swiss.3 Call: lda(status ~ length + left + right + bottom + top + diag, data = swiss) Prior probabilities of groups: counterfeit genuine 0.5 0.5 Group means: length left right bottom top diag counterfeit 214.823 130.300 130.193 10.530 11.133 139.450 genuine 214.969 129.943 129.720 8.305 10.168 141.517 Coefficients of linear discriminants: LD1 length 0.005011113 left 0.832432523 right -0.848993093 bottom -1.117335597 top -1.178884468 diag 1.556520967 $$\blacksquare$$ 1. How many linear discriminants did you get? Is that making sense? Explain briefly. Solution I got one discriminant, which makes sense because there are two groups, and the smaller of 6 (variables, not counting the grouping one) and $$2-1$$ is 1. $$\blacksquare$$ 1. * Using your output from the discriminant analysis, describe how each of the linear discriminants that you got is related to your original variables. (This can, maybe even should, be done crudely: “does each variable feature in each linear discriminant: yes or no?”.) Solution This is the Coefficients of Linear Discriminants. Make a call about whether each of those coefficients is close to zero (small in size compared to the others), or definitely positive or definitely negative. These are judgement calls: either you can say that LD1 depends mainly on diag (treating the other coefficients as “small” or close to zero), or you can say that LD1 depends on everything except length. $$\blacksquare$$ 1. What values of your variable(s) would make LD1 large and positive? Solution Depending on your answer to the previous part: If you said that only diag was important, diag being large would make LD1 large and positive. If you said that everything but length was important, then it’s a bit more complicated: left and diag large, right, bottom and top small (since their coefficients are negative). $$\blacksquare$$ 1. * Find the means of each variable for each group (genuine and counterfeit bills). You can get this from your fitted linear discriminant object. Solution swiss.3$means
             length    left   right bottom    top    diag
counterfeit 214.823 130.300 130.193 10.530 11.133 139.450
genuine     214.969 129.943 129.720  8.305 10.168 141.517

$$\blacksquare$$

1. Plot your linear discriminant(s), however you like. Bear in mind that there is only one linear discriminant.

Solution

With only one linear discriminant, we can plot LD1 scores on the $$y$$-axis and the grouping variable on the $$x$$-axis. How you do that is up to you.

Before we start, though, we need the LD1 scores. This means doing predictions. The discriminant scores are in there. We take the prediction output and make a data frame with all the things in the original data. My current preference (it changes) is to store the predictions, and then cbind them with the original data, thus:

swiss.pred <- predict(swiss.3)
d <- cbind(swiss, swiss.pred)
head(d)

I needed head because cbind makes an old-fashioned data.frame rather than a tibble, so if you display it, you get all of it.

This gives the LD1 scores, predicted groups, and posterior probabilities as well. That saves us having to pick out the other things later. The obvious thing is a boxplot. By examining d above (didn’t you?), you saw that the LD scores were in a column called LD1:

ggplot(d, aes(x = status, y = LD1)) + geom_boxplot()

This shows that positive LD1 scores go (almost without exception) with genuine bills, and negative ones with counterfeit bills. It also shows that there are three outlier bills, two counterfeit ones with unusually high LD1 score, and one genuine one with unusually low LD1 score, at least for a genuine bill.

This goes to show that (the Box M test notwithstanding) the two types of bill really are different in a way that is worth investigating.

Or you could do faceted histograms of LD1 by status:

ggplot(d, aes(x = LD1)) + geom_histogram(bins = 10) + facet_grid(status ~ .)

This shows much the same thing as plot(swiss.3) does (try it).

$$\blacksquare$$

1. What kind of score on LD1 do genuine bills typically have? What kind of score do counterfeit bills typically have? What characteristics of a bill, therefore, would you look at to determine if a bill is genuine or counterfeit?

Solution

The genuine bills almost all have a positive score on LD1, while the counterfeit ones all have a negative one. This means that the genuine bills (depending on your answer to (here)) have a large diag, or they have a large left and diag, and a small right, bottom and top. If you look at your table of means in (here), you’ll see that the genuine bills do indeed have a large diag, or, depending on your earlier answer, a small right, bottom and top, but not actually a small left (the left values are very close for the genuine and counterfeit coins).

Extra: as to that last point, this is easy enough to think about. A boxplot seems a nice way to display it:

ggplot(d, aes(y = left, x = status)) + geom_boxplot()

There is a fair bit of overlap: the median is higher for the counterfeit bills, but the highest value actually belongs to a genuine one.

Compare that to diag:

ggplot(d, aes(y = diag, x = status)) + geom_boxplot()

Here, there is an almost complete separation of the genuine and counterfeit bills, with just one low outlier amongst the genuine bills spoiling the pattern. I didn’t look at the predictions (beyond the discriminant scores), since this question (as set on an assignment a couple of years ago) was already too long, but there is no difficulty in doing so. Everything is in the data frame I called d:

with(d, table(obs = status, pred = class))
             pred
obs           counterfeit genuine
counterfeit         100       0
genuine               1      99

(this labels the rows and columns, which is not necessary but is nice.)

The tidyverse way is to make a data frame out of the actual and predicted statuses, and then count what’s in there:

d %>% count(status, class)

This gives a “long” table, with frequencies for each of the combinations for which anything was observed.

Frequency tables are usually wide, and we can make this one so by pivot-wider-ing pred:

d %>%
count(status, class) %>%
pivot_wider(names_from = class, values_from = n)

One of the genuine bills is incorrectly classified as a counterfeit one (evidently that low outlier on LD1), but every single one of the counterfeit bills is classified correctly. That missing value is actually a frequency that is zero, which you can fix up thus:

d %>%
count(status, class) %>%
pivot_wider(names_from = class, values_from = n, values_fill = 0) 

which turns any missing values into the zeroes they should be in this kind of problem. It would be interesting to see what the posterior probabilities look like for that misclassified bill:

d %>% filter(status != class)

On the basis of the six measured variables, this looks a lot more like a counterfeit bill than a genuine one. Are there any other bills where there is any doubt? One way to find out is to find the maximum of the two posterior probabilities. If this is small, there is some doubt about whether the bill is real or fake. 0.99 seems like a very stringent cutoff, but let’s try it and see:

d %>%
mutate(max.post = pmax(posterior.counterfeit, posterior.genuine)) %>%
filter(max.post < 0.99) %>%
dplyr::select(-c(length:diag))

The only one is the bill that was misclassified: it was actually genuine, but was classified as counterfeit. The posterior probabilities say that it was pretty unlikely to be genuine, but it was the only bill for which there was any noticeable doubt at all.

I had to use pmax rather than max there, because I wanted max.post to contain the larger of the two corresponding entries: that is, the first entry in max.post is the larger of the first entry of counterfeit and the first entry in genuine. If I used max instead, I’d get the largest of all the entries in counterfeit and all the entries in genuine, repeated 200 times. (Try it and see.) pmax stands for “parallel maximum”, that is, for each row separately. This also should work:

d %>%
rowwise() %>%
mutate(max.post = max(posterior.counterfeit, posterior.genuine)) %>%
filter(max.post < 0.99) %>%
select(-c(length:diag))

Because we’re using rowwise, max is applied to the pairs of values of posterior.counterfeit and posterior.genuine, taken one row at a time.

$$\blacksquare$$

## 35.9 Urine and obesity: what makes a difference?

A study was made of the characteristics of urine of young men. The men were classified into four groups based on their degree of obesity. (The groups are labelled a, b, c, d.) Four variables were measured, x (which you can ignore), pigment creatinine, chloride and chlorine. The data are in link as a .csv file. There are 45 men altogether.

Yes, you may have seen this one before. What you found was something like this:

my_url <- "http://ritsokiguess.site/datafiles/urine.csv"
urine <- read_csv(my_url)
Rows: 45 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): obesity
dbl (4): x, creatinine, chloride, chlorine

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
response <- with(urine, cbind(creatinine, chlorine, chloride))
urine.1 <- manova(response ~ obesity, data = urine)
summary(urine.1)
          Df  Pillai approx F num Df den Df  Pr(>F)
obesity    3 0.43144   2.2956      9    123 0.02034 *
Residuals 41
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(BoxM(response, urine$obesity))  Box's M Test Chi-Squared Value = 30.8322 , df = 18 and p-value: 0.0301  Our aim is to understand why this result was significant. (Remember that the P-value on Box’s M test is not small enough to be worried about.) 1. Read in the data again (copy the code from above) and obtain a discriminant analysis. Solution As above, plus: urine.1 <- lda(obesity ~ creatinine + chlorine + chloride, data = urine) urine.1 Call: lda(obesity ~ creatinine + chlorine + chloride, data = urine) Prior probabilities of groups: a b c d 0.2666667 0.3111111 0.2444444 0.1777778 Group means: creatinine chlorine chloride a 15.89167 5.275000 6.012500 b 17.82143 7.450000 5.214286 c 16.34545 8.272727 5.372727 d 11.91250 9.675000 3.981250 Coefficients of linear discriminants: LD1 LD2 LD3 creatinine 0.24429462 -0.1700525 -0.02623962 chlorine -0.02167823 -0.1353051 0.11524045 chloride 0.23805588 0.3590364 0.30564592 Proportion of trace: LD1 LD2 LD3 0.7476 0.2430 0.0093  $$\blacksquare$$ 1. How many linear discriminants were you expecting? Explain briefly. Solution There are 3 variables and 4 groups, so the smaller of 3 and $$4-1=3$$: that is, 3. $$\blacksquare$$ 1. Why do you think we should pay attention to the first two linear discriminants but not the third? Explain briefly. Solution The first two “proportion of trace” values are a lot bigger than the third (or, the third one is close to 0). $$\blacksquare$$ 1. Plot the first two linear discriminant scores (against each other), with each obesity group being a different colour. Solution First obtain the predictions, and then make a data frame out of the original data and the predictions. urine.pred <- predict(urine.1) d <- cbind(urine, urine.pred) head(d) urine produced the first five columns and urine.pred produced the rest. To go a more tidyverse way, we can combine the original data frame and the predictions using bind_cols, but we have to be more careful that the things we are gluing together are both data frames: class(urine) [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"  class(urine.pred) [1] "list" urine is a tibble all right, but urine.pred is a list. What does it look like? glimpse(urine.pred) List of 3$ class    : Factor w/ 4 levels "a","b","c","d": 2 1 2 2 1 1 3 3 1 1 ...
$posterior: num [1:45, 1:4] 0.233 0.36 0.227 0.294 0.477 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:45] "1" "2" "3" "4" ...
.. ..$: chr [1:4] "a" "b" "c" "d"$ x        : num [1:45, 1:3] 0.393 -0.482 0.975 2.188 2.018 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:45] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "LD1" "LD2" "LD3"

A data frame is a list for which all the items are the same length, but some of the things in here are matrices. You can tell because they have a number of rows, 45, and a number of columns, 3 or 4. They do have the right number of rows, though, so something like as.data.frame (a base R function) will smoosh them all into one data frame, grabbing the columns from the matrices:

head(as.data.frame(urine.pred))

You see that the columns that came from matrices have gained two-part names, the first part from the name of the matrix, the second part from the column name within that matrix. Then we can do this:

dd <- bind_cols(urine, as.data.frame(urine.pred))
dd

If you want to avoid base R altogether, though, and go straight to bind_cols, you have to be more careful about the types of things. bind_cols only works with vectors and data frames, not matrices, so that is what it is up to you to make sure you have. That means pulling out the pieces, turning them from matrices into data frames, and then gluing everything back together:

post <- as_tibble(urine.pred$posterior) ld <- as_tibble(urine.pred$x)
ddd <- bind_cols(urine, class = urine.pred$class, ld, post) ddd That’s a lot of work, but you might say that it’s worth it because you are now absolutely sure what kind of thing everything is. I also had to be slightly careful with the vector of class values; in ddd it has to have a name, so I have to make sure I give it one.2 Any of these ways (in general) is good. The last way is a more careful approach, since you are making sure things are of the right type rather than relying on R to convert them for you, but I don’t mind which way you go. Now make the plot, making sure that you are using columns with the right names. I’m using my first data frame, with the two-part names: ggplot(d, aes(x = x.LD1, y = x.LD2, colour = obesity)) + geom_point() $$\blacksquare$$ 1. * Looking at your plot, discuss how (if at all) the discriminants separate the obesity groups. (Where does each obesity group fall on the plot?) Solution My immediate reaction was “they don’t much”. If you look a bit more closely, the b group, in green, is on the right (high LD1) and the d group (purple) is on the left (low LD1). The a group, red, is mostly at the top (high LD2) but the c group, blue, really is all over the place. The way to tackle interpreting a plot like this is to look for each group individually and see if that group is only or mainly found on a certain part of the plot. This can be rationalized by looking at the “coefficients of linear discriminants” on the output. LD1 is low if creatinine and chloride are low (it has nothing much to do with chlorine since that coefficient is near zero). Group d is lowest on both creatinine and chloride, so that will be lowest on LD1. LD2 is high if chloride is high, or creatinine and chlorine are low. Out of the groups a, b, c, a has the highest mean on chloride and lowest means on the other two variables, so this should be highest on LD2 and (usually) is. Looking at the means is only part of the story; if the individuals within a group are very variable, as they are here (especially group c), then that group will appear all over the plot. The table of means only says how the average individual within a group stacks up. ggbiplot(urine.1, groups = urine$obesity)

This shows (in a way that is perhaps easier to see) how the linear discriminants are related to the original variables, and thus how the groups differ in terms of the original variables.3 Most of the B’s are high creatinine and high chloride (on the right); most of the D’s are low on both (on the left). LD2 has a bit of chloride, but not much of anything else. Extra: the way we used to do this was with “base graphics”, which involved plotting the lda output itself:

plot(urine.1)

which is a plot of each discriminant score against each other one. You can plot just the first two, like this:

plot(urine.1, dimen = 2)

This is easier than using ggplot, but (i) less flexible and (ii) you have to figure out how it works rather than doing things the standard ggplot way. So I went with constructing a data frame from the predictions, and then ggplotting that. It’s a matter of taste which way is better.

$$\blacksquare$$

1. * Obtain a table showing observed and predicted obesity groups. Comment on the accuracy of the predictions.

Solution

Make a table, one way or another:

tab <- with(d, table(obesity, class))
tab
       class
obesity a b c d
a 7 3 2 0
b 2 9 2 1
c 3 4 1 3
d 2 0 1 5

class is always the predicted group in these. You can also name things in table. Or, if you prefer (equally good), the tidyverse way of counting all the combinations of true obesity and predicted class, which can be done all in one go, or in two steps by saving the data frame first. I’m saving my results for later:

d %>% count(obesity, class) -> tab
tab

or if you prefer to make it look more like a table of frequencies:

tab %>% pivot_wider(names_from=class, values_from=n, values_fill = list(n=0))

The thing on the end fills in zero frequencies as such (they would otherwise be NA, which they are not: we know they are zero). My immediate reaction to this is “it’s terrible”! But at least some of the men have their obesity group correctly predicted: 7 of the $$7+3+2+0=12$$ men that are actually in group a are predicted to be in a; 9 of the 14 actual b’s are predicted to be b’s; 5 of the 8 actual d’s are predicted to be d’s. These are not so awful. But only 1 of the 11 c’s is correctly predicted to be a c!

As for what I want to see: I am looking for some kind of statement about how good you think the predictions are (the word “terrible” is fine for this) with some kind of support for your statement. For example, “the predictions are not that good, but at least group B is predicted with some accuracy (9 out of 14).”

I think looking at how well the individual groups were predicted is the most incisive way of getting at this, because the c men are the hardest to get right and the others are easier, but you could also think about an overall misclassification rate. This comes most easily from the “tidy” table:

tab %>% count(correct = (obesity == class), wt = n)

You can count anything, not just columns that already exist. This one is a kind of combined mutate-and-count to create the (logical) column called correct.

It’s a shortcut for this:

tab %>%
mutate(is_correct = (obesity == class)) %>%
count(is_correct, wt = n)

If I don’t put the wt, count counts the number of rows for which the true and predicted obesity group is the same. But that’s not what I want here: I want the number of observations totalled up, which is what the wt= does. It says “use the things in the given column as weights”, which means to total them up rather than count up the number of rows.

This says that 22 men were classified correctly and 23 were gotten wrong. We can find the proportions correct and wrong:

tab %>%
count(correct = (obesity == class), wt = n) %>%
mutate(proportion = n / sum(n))

and we see that 51% of men had their obesity group predicted wrongly. This is the overall misclassification rate, which is a simple summary of how good a job the discriminant analysis did.

There is a subtlety here. n has changed its meaning in the middle of this calculation! In tab, n is counting the number of obesity observed and predicted combinations, but now it is counting the number of men classified correctly and incorrectly. The wt=n uses the first n, but the mutate line uses the new n, the result of the count line here. (I think count used to use nn for the result of the second count, so that you could tell them apart, but it no longer seems to do so.)

I said above that the obesity groups were not equally easy to predict. A small modification of the above will get the misclassification rates by (true) obesity group. This is done by putting an appropriate group_by in at the front, before we do any summarizing:

tab %>%
group_by(obesity) %>%
count(correct = (obesity == class), wt = n) %>%
mutate(proportion = n / sum(n))

This gives the proportion wrong and correct for each (true) obesity group. I’m going to do the one more cosmetic thing to make it easier to read, a kind of “untidying”:

tab %>%
group_by(obesity) %>%
count(correct = (obesity == class), wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=correct, values_from=proportion)

Looking down the TRUE column, groups A, B and D were gotten about 60% correct (and 40% wrong), but group C is much worse. The overall misclassification rate is made bigger by the fact that C is so hard to predict.

Find out for yourself what happens if I fail to remove the n column before doing the pivot_wider.

A slightly more elegant look is obtained this way, by making nicer values than TRUE and FALSE:

tab %>%
group_by(obesity) %>%
mutate(prediction_stat = ifelse(obesity == class, "correct", "wrong")) %>%
count(prediction_stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=prediction_stat, values_from=proportion)

$$\blacksquare$$

1. Do your conclusions from (here) and (here) appear to be consistent?

Solution

On the plot of (here), we said that there was a lot of scatter, but that groups a, b and d tended to be found at the top, right and left respectively of the plot. That suggests that these three groups should be somewhat predictable. The c’s, on the other hand, were all over the place on the plot, and were mostly predicted wrong.

The idea is that the stories you pull from the plot and the predictions should be more or less consistent. There are several ways you might say that: another approach is to say that the observations are all over the place on the plot, and the predictions are all bad. This is not as insightful as my comments above, but if that’s what the plot told you, that’s what the predictions would seem to be saying as well. (Or even, the predictions are not so bad compared to the apparently random pattern on the plot, if that’s what you saw. There are different ways to say something more or less sensible.)

$$\blacksquare$$

## 35.10 Understanding a MANOVA

One use of discriminant analysis is to understand the results of a MANOVA. This question is a followup to a previous MANOVA that we did, the one with two variables y1 and y2 and three groups a through c. The data were in link.

1. Read the data in again and run the MANOVA that you did before.

Solution

This is an exact repeat of what you did before:

my_url <- "http://ritsokiguess.site/datafiles/simple-manova.txt"
# my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/simple-manova.txt"
simple <- read_delim(my_url, " ")
Rows: 12 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): group
dbl (2): y1, y2

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
simple
response <- with(simple, cbind(y1, y2))
simple.3 <- manova(response ~ group, data = simple)
summary(simple.3)
          Df Pillai approx F num Df den Df    Pr(>F)
group      2 1.3534   9.4196      4     18 0.0002735 ***
Residuals  9
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This P-value is small, so there is some way in which some of the groups differ on some of the variables.4

We should check that we believe this, using Box’s M test:5

summary(BoxM(response, simple$group))  Box's M Test Chi-Squared Value = 3.517357 , df = 6 and p-value: 0.742  There is no problem here: no evidence that any of the response variables differ in spread across the groups. $$\blacksquare$$ 1. Run a discriminant analysis “predicting” group from the two response variables. Display the output. Solution This: simple.4 <- lda(group ~ y1 + y2, data = simple) simple.4 Call: lda(group ~ y1 + y2, data = simple) Prior probabilities of groups: a b c 0.3333333 0.2500000 0.4166667 Group means: y1 y2 a 3.000000 4.0 b 4.666667 7.0 c 8.200000 6.4 Coefficients of linear discriminants: LD1 LD2 y1 0.7193766 0.4060972 y2 0.3611104 -0.9319337 Proportion of trace: LD1 LD2 0.8331 0.1669  Note that this is the other way around from MANOVA: here, we are “predicting the group” from the response variables, in the same manner as one of the flavours of logistic regression: “what makes the groups different, in terms of those response variables?”. $$\blacksquare$$ 1. * In the output from the discriminant analysis, why are there exactly two linear discriminants LD1 and LD2? Solution There are two linear discriminants because there are 3 groups and two variables, so there are the smaller of $$3-1$$ and 2 discriminants. $$\blacksquare$$ 1. * From the output, how would you say that the first linear discriminant LD1 compares in importance to the second one LD2: much more important, more important, equally important, less important, much less important? Explain briefly. Solution Look at the Proportion of trace at the bottom of the output. The first number is much bigger than the second, so the first linear discriminant is much more important than the second. (I care about your reason; you can say it’s “more important” rather than “much more important” and I’m good with that.) $$\blacksquare$$ 1. Obtain a plot of the discriminant scores. Solution This was the old-fashioned way: plot(simple.4) It needs cajoling to produce colours, but we can do better. The first thing is to obtain the predictions: simple.pred <- predict(simple.4) Then we make a data frame out of the discriminant scores and the true groups, using cbind: d <- cbind(simple, simple.pred) head(d) or like this, for fun:6 ld <- as_tibble(simple.pred$x)
post <- as_tibble(simple.pred$posterior) dd <- bind_cols(simple, class = simple.pred$class, ld, post)
dd

After that, we plot the first one against the second one, colouring by true groups:

ggplot(d, aes(x = x.LD1, y = x.LD2, colour = group)) + geom_point()

I wanted to compare this plot with the original plot of y1 vs. y2, coloured by groups:

ggplot(simple, aes(x = y1, y = y2, colour = group)) + geom_point()

The difference between this plot and the one of LD1 vs.
LD2 is that things have been rotated a bit so that most of the separation of groups is done by LD1. This is reflected in the fact that LD1 is quite a bit more important than LD2: the latter doesn’t help much in separating the groups.

With that in mind, we could also plot just LD1, presumably against groups via boxplot:

ggplot(d, aes(x = group, y = x.LD1)) + geom_boxplot()

This shows that LD1 does a pretty fine job of separating the groups, and LD2 doesn’t really have much to add to the picture.

$$\blacksquare$$

1. Describe briefly how LD1 and/or LD2 separate the groups. Does your picture confirm the relative importance of LD1 and LD2 that you found back in part (here)? Explain briefly.

Solution

LD1 separates the groups left to right: group a is low on LD1, b is in the middle and c is high on LD1. (There is no intermingling of the groups on LD1, so it separates the groups perfectly.)

As for LD2, all it does (possibly) is to distinguish b (low) from a and c (high). Or you can, just as reasonably, take the view that it doesn’t really separate any of the groups.

Back in part (here), you said (I hope) that LD1 was (very) important compared to LD2. This shows up here in that LD1 does a very good job of distinguishing the groups, while LD2 does a poor to non-existent job of separating any groups. (If you didn’t say that before, here is an invitation to reconsider what you did say there.)

$$\blacksquare$$

1. What makes group a have a low score on LD1? There are two steps that you need to make: consider the means of group a on variables y1 and y2 and how they compare to the other groups, and consider how y1 and y2 play into the score on LD1.

Solution

The information you need is in the big output.

The means of y1 and y2 for group a are 3 and 4 respectively, which are the lowest of all the groups. That’s the first thing.

The second thing is the coefficients of LD1 in terms of y1 and y2, which are both positive. That means, for any observation, if its y1 and y2 values are large, that observation’s score on LD1 will be large as well. Conversely, if its values are small, as the ones in group a are, its score on LD1 will be small.

You need these two things.

This explains why the group a observations are on the left of the plot. It also explains why the group c observations are on the right: they are large on both y1 and y2, and so large on LD1.

What about LD2? This is a little more confusing (and thus I didn’t ask you about that). Its “coefficients of linear discriminant” are positive on y1 and negative on y2, with the latter being bigger in size. Group b is about average on y1 and distinctly high on y2; the second of these coupled with the negative coefficient on y2 means that the LD2 score for observations in group b will be negative.

For LD2, group a has a low mean on both variables and group c has a high mean, so for both groups there is a kind of cancelling-out happening, and neither group a nor group c will be especially remarkable on LD2.

$$\blacksquare$$

1. Obtain predictions for the group memberships of each observation, and make a table of the actual group memberships against the predicted ones. How many of the observations were wrongly classified?

Solution

Use the simple.pred that you got earlier. This is the table way:

with(d, table(obs = group, pred = class))
   pred
obs a b c
a 4 0 0
b 0 3 0
c 0 0 5

Every single one of the 12 observations has been classified into its correct group. (There is nothing off the diagonal of this table.) The alternative to table is the tidyverse way:

d %>% count(group, class)

or

d %>%
count(group, class) %>%
pivot_wider(names_from=class, values_from=n, values_fill = list(n=0))

if you want something that looks like a frequency table. All the as got classified as a, and so on. That’s the end of what I asked you to do, but as ever I wanted to press on. The next question to ask after getting the predicted groups is “what are the posterior probabilities of being in each group for each observation”: that is, not just which group do I think it belongs in, but how sure am I about that call? The posterior probabilities in my d start with posterior. These have a ton of decimal places which I like to round off first before I display them, eg. to 3 decimals here:

d %>%
select(y1, y2, group, class, starts_with("posterior")) %>%
mutate(across(starts_with("posterior"), $$post) round(post, 3))) You see that the posterior probability of an observation being in the group it actually was in is close to 1 all the way down. The only one with any doubt at all is observation #6, which is actually in group b, but has “only” probability 0.814 of being a b based on its y1 and y2 values. What else could it be? Well, it’s about equally split between being a and c. Let me see if I can display this observation on the plot in a different way. First I need to make a new column picking out observation 6, and then I use this new variable as the shape of the point I plot: simple %>% mutate(is6 = (row_number() == 6)) %>% ggplot(aes(x = y1, y = y2, colour = group, shape = is6)) + geom_point(size = 3) That makes it stand out a bit: if you look carefully, one of the green points (observation 6) is plotted as a triangle rather than a circle, as the legend for is6 indicates. (I plotted all the points bigger to make this easier to see.) Since observation #6 is in group b, it appears as a green triangle. What makes it least like a b? Well, it has the smallest y2 value of any of the b’s (which makes it most like an a of any of the b’s), and it has the largest y1 value (which makes it most like a c of any of the b’s). But still, it’s nearer the greens than anything else, so it’s still more like a b than it is like any of the other groups. \(\blacksquare$$

## 35.11 What distinguishes people who do different jobs?

2447 people work at a certain company. They each have one of three jobs: customer service, mechanic, dispatcher. In the data set, these are labelled 1, 2 and 3 respectively. In addition, they each are rated on scales called outdoor, social and conservative. Do people with different jobs tend to have different scores on these scales, or, to put it another way, if you knew a person’s scores on outdoor, social and conservative, could you say something about what kind of job they were likely to hold? The data are in link.

1. Read in the data and display some of it.

Solution

The usual. This one is aligned columns. I’m using a “temporary” name for my read-in data frame, since I’m going to create the proper one in a moment.

my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt"
jobs0 <- read_table(my_url)
Warning: Missing column names filled in: 'X6' [6]

── Column specification ────────────────────────────────────────────────────────
cols(
outdoor = col_double(),
social = col_double(),
conservative = col_double(),
job = col_double(),
id = col_double(),
X6 = col_character()
)
Warning: 244 parsing failures.
row col  expected    actual                                                                 file
1  -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
2  -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
3  -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
4  -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
5  -- 6 columns 5 columns 'https://raw.githubusercontent.com/nxskok/datafiles/master/jobs.txt'
... ... ......... ......... ....................................................................
See problems(...) for more details.
jobs0

We got all that was promised, plus a label id for each employee, which we will from here on ignore.8

$$\blacksquare$$

1. Note the types of each of the variables, and create any new variables that you need to.

Solution

These are all int or whole numbers. But, the job ought to be a factor: the labels 1, 2 and 3 have no meaning as such, they just label the three different jobs. (I gave you a hint of this above.) So we need to turn job into a factor. I think the best way to do that is via mutate, and then we save the new data frame into one called jobs that we actually use for the analysis below:

job_labels <- c("custserv", "mechanic", "dispatcher")
jobs0 %>%
mutate(job = factor(job, labels = job_labels)) -> jobs

I lived on the edge and saved my factor job into a variable with the same name as the numeric one. I should check that I now have the right thing:

jobs

I like this better because you see the actual factor levels rather than the underlying numeric values by which they are stored.

All is good here. If you forget the labels thing, you’ll get a factor, but its levels will be 1, 2, and 3, and you will have to remember which jobs they go with. I’m a fan of giving factors named levels, so that you can remember what stands for what.9

Extra: another way of doing this is to make a lookup table, that is, a little table that shows which job goes with which number:

lookup_tab <- tribble(
~job, ~jobname,
1, "custserv",
2, "mechanic",
3, "dispatcher"
)
lookup_tab

I carefully put the numbers in a column called job because I want to match these with the column called job in jobs0:

jobs0 %>%
left_join(lookup_tab) %>%
sample_n(20)
Joining with by = join_by(job)

You see that each row has the name of the job that employee has, in the column jobname, because the job id was looked up in our lookup table. (I displayed some random rows so you could see that it worked.)

$$\blacksquare$$

1. Run a multivariate analysis of variance to convince yourself that there are some differences in scale scores among the jobs.

Solution

You know how to do this, right? This one is the easy way:

response <- with(jobs, cbind(social, outdoor, conservative))
response.1 <- manova(response ~ job, data = jobs)
summary(response.1)
           Df  Pillai approx F num Df den Df    Pr(>F)
job         2 0.76207   49.248      6    480 < 2.2e-16 ***
Residuals 241
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or you can use Manova. That is mostly for practice here, since there is no reason to make things difficult for yourself:

library(car)
response.2 <- lm(response ~ job, data = jobs)
summary(Manova(response.2))

Type II MANOVA Tests:

Sum of squares and products for error:
social    outdoor conservative
social       4406.2994  334.90273    235.46301
outdoor       334.9027 4082.46302     64.76334
conservative  235.4630   64.76334   2683.25695

------------------------------------------

Term: job

Sum of squares and products for the hypothesis:
social   outdoor conservative
social        2889.1228 -794.3945   -1405.8401
outdoor       -794.3945 1609.7993     283.1711
conservative -1405.8401  283.1711     691.7594

Multivariate Tests: job
Df test stat approx F num Df den Df     Pr(>F)
Pillai            2 0.7620657 49.24757      6    480 < 2.22e-16 ***
Wilks             2 0.3639880 52.38173      6    478 < 2.22e-16 ***
Hotelling-Lawley  2 1.4010307 55.57422      6    476 < 2.22e-16 ***
Roy               2 1.0805270 86.44216      3    240 < 2.22e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This version gives the four different versions of the test (rather than just the Pillai test that manova gives), but the results are in this case identical for all of them.

So: oh yes, there are differences (on some or all of the variables, for some or all of the groups). So we need something like discriminant analysis to understand the differences.

We really ought to follow this up with Box’s M test, to be sure that the variances and correlations for each variable are equal enough across the groups, but we note off the top that the P-values (all of them) are really small, so there ought not to be much doubt about the conclusion anyway:

summary(BoxM(response, jobs$job))  Box's M Test Chi-Squared Value = 25.64176 , df = 12 and p-value: 0.0121  This is small, but not (for this test) small enough to worry about (it’s not less than 0.001). This, and the lda below, actually works perfectly well if you use the original (integer) job, but then you have to remember which job number is which. $$\blacksquare$$ 1. Run a discriminant analysis and display the output. Solution Now job is the “response”: job.1 <- lda(job ~ social + outdoor + conservative, data = jobs) job.1 Call: lda(job ~ social + outdoor + conservative, data = jobs) Prior probabilities of groups: custserv mechanic dispatcher 0.3483607 0.3811475 0.2704918 Group means: social outdoor conservative custserv 24.22353 12.51765 9.023529 mechanic 21.13978 18.53763 10.139785 dispatcher 15.45455 15.57576 13.242424 Coefficients of linear discriminants: LD1 LD2 social -0.19427415 -0.04978105 outdoor 0.09198065 -0.22501431 conservative 0.15499199 0.08734288 Proportion of trace: LD1 LD2 0.7712 0.2288  $$\blacksquare$$ 1. Which is the more important, LD1 or LD2? How much more important? Justify your answer briefly. Solution Look at the “proportion of trace” at the bottom. The value for LD1 is quite a bit higher, so LD1 is quite a bit more important when it comes to separating the groups. LD2 is, as I said, less important, but is not completely worthless, so it will be worth taking a look at it. $$\blacksquare$$ 1. Describe what values for an individual on the scales will make each of LD1 and LD2 high. Solution This is a two-parter: decide whether each scale makes a positive, negative or zero contribution to the linear discriminant (looking at the “coefficients of linear discriminants”), and then translate that into what would make each LD high. Let’s start with LD1: Its coefficients on the three scales are respectively negative ($$-0.19$$), zero (0.09; my call) and positive (0.15). Where you draw the line is up to you: if you want to say that outdoor’s contribution is positive, go ahead. This means that LD1 will be high if social is low and if conservative is high. (If you thought that outdoor’s coefficient was positive rather than zero, if outdoor is high as well.) Now for LD2: I’m going to call outdoor’s coefficient of $$-0.22$$ negative and the other two zero, so that LD2 is high if outdoor is low. Again, if you made a different judgement call, adapt your answer accordingly. $$\blacksquare$$ 1. The first group of employees, customer service, have the highest mean on social and the lowest mean on both of the other two scales. Would you expect the customer service employees to score high or low on LD1? What about LD2? Solution In the light of what we said in the previous part, the customer service employees, who are high on social and low on conservative, should be low (negative) on LD1, since both of these means are pointing that way. As I called it, the only thing that matters to LD2 is outdoor, which is low for the customer service employees, and thus LD2 for them will be high (negative coefficient). $$\blacksquare$$ 1. Plot your discriminant scores (which you will have to obtain first), and see if you were right about the customer service employees in terms of LD1 and LD2. The job names are rather long, and there are a lot of individuals, so it is probably best to plot the scores as coloured circles with a legend saying which colour goes with which job (rather than labelling each individual with the job they have). Solution Predictions first, then make a data frame combining the predictions with the original data: p <- predict(job.1) d <- cbind(jobs, p) d Following my suggestion, plot these the standard way with colour distinguishing the jobs: ggplot(d, aes(x = x.LD1, y = x.LD2, colour = job)) + geom_point() ggplot(d, aes(x = x.LD1, y = x.LD2, colour = class)) + geom_point() # ggplot(d, aes(x = job, y = x.LD1)) + geom_boxplot() I was mostly right about the customer service people: small LD1 definitely, large LD2 kinda. I wasn’t more right because the group means don’t tell the whole story: evidently, the customer service people vary quite a bit on outdoor, so the red dots are all over the left side of the plot. There is quite a bit of intermingling of the three employee groups on the plot, but the point of the MANOVA is that the groups are (way) more separated than you’d expect by chance, that is if the employees were just randomly scattered across the plot. To think back to that trace thing: here, it seems that LD1 mainly separates customer service (left) from dispatchers (right); the mechanics are all over the place on LD1, but they tend to be low on LD2. So LD2 does have something to say. $$\blacksquare$$ 1. * Obtain predicted job allocations for each individual (based on their scores on the three scales), and tabulate the true jobs against the predicted jobs. How would you describe the quality of the classification? Is that in line with what the plot would suggest? Solution Use the predictions that you got before and saved in d: with(d, table(obs = job, pred = class))  pred obs custserv mechanic dispatcher custserv 68 13 4 mechanic 16 67 10 dispatcher 3 13 50 Or, the tidyverse way: d %>% count(job, class) or: d %>% count(job, class) %>% pivot_wider(names_from=class, values_from=n, values_fill = list(n=0)) I didn’t really need the values_fill since there are no missing frequencies, but I’ve gotten used to putting it in. There are a lot of misclassifications, but there are a lot of people, so a large fraction of people actually got classified correctly. The biggest frequencies are of people who got classified correctly. I think this is about what I was expecting, looking at the plot: the people top left are obviously customer service, the ones top right are in dispatch, and most of the ones at the bottom are mechanics. So there will be some errors, but the majority of people should be gotten right. The easiest pairing to get confused is customer service and mechanics, which you might guess from the plot: those customer service people with a middling LD1 score and a low LD2 score (that is, high on outdoor) could easily be confused with the mechanics. The easiest pairing to distinguish is customer service and dispatchers: on the plot, left and right, that is, low and high respectively on LD1. What fraction of people actually got misclassified? You could just pull out the numbers and add them up, but you know me: I’m too lazy to do that. We can work out the total number and fraction who got misclassified. There are different ways you might do this, but the tidyverse way provides the easiest starting point. For example, we can make a new column that indicates whether a group is the correct or wrong classification: d %>% count(job, class) %>% mutate(job_stat = ifelse(job == class, "correct", "wrong")) From there, we count up the correct and wrong ones, recognizing that we want to total up the frequencies in n, not just count the number of rows: d %>% count(job, class) %>% mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>% count(job_stat, wt = n) and turn these into proportions: d %>% count(job, class) %>% mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>% count(job_stat, wt = n) %>% mutate(proportion = n / sum(n)) There is a count followed by another count of the first lot of counts, so the second count column has taken over the name n. 24% of all the employees got classified into the wrong job, based on their scores on outdoor, social and conservative. This is actually not bad, from one point of view: if you just guessed which job each person did, without looking at their scores on the scales at all, you would get $${1\over 3}=33\%$$ of them right, just by luck, and $${2\over3}=67\%$$ of them wrong. From 67% to 24% error is a big improvement, and that is what the MANOVA is reacting to. To figure out whether some of the groups were harder to classify than others, squeeze a group_by in early to do the counts and proportions for each (true) job: d %>% count(job, class) %>% mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>% group_by(job) %>% count(job_stat, wt = n) %>% mutate(proportion = n / sum(n)) or even split out the correct and wrong ones into their own columns: d %>% count(job, class) %>% mutate(job_stat = ifelse(job == class, "correct", "wrong")) %>% group_by(job) %>% count(job_stat, wt = n) %>% mutate(proportion = n / sum(n)) %>% select(-n) %>% pivot_wider(names_from=job_stat, values_from=proportion) The mechanics were hardest to get right and easiest to get wrong, though there isn’t much in it. I think the reason is that the mechanics were sort of “in the middle” in that a mechanic could be mistaken for either a dispatcher or a customer service representative, but but customer service and dispatchers were more or less distinct from each other. It’s up to you whether you prefer to do this kind of thing by learning enough about table to get it to work, or whether you want to use tidy-data mechanisms to do it in a larger number of smaller steps. I immediately thought of table because I knew about it, but the tidy-data way is more consistent with the way we have been doing things. $$\blacksquare$$ 1. Consider an employee with these scores: 20 on outdoor, 17 on social and 8 on conservative What job do you think they do, and how certain are you about that? Use predict, first making a data frame out of the values to predict for. Solution This is in fact exactly the same idea as the data frame that I generally called new when doing predictions for other models. I think the clearest way to make one of these is with tribble: new <- tribble( ~outdoor, ~social, ~conservative, 20, 17, 8 ) new summary(jobs)  outdoor social conservative job Min. : 0.00 Min. : 7.00 Min. : 0.00 custserv :85 1st Qu.:13.00 1st Qu.:17.00 1st Qu.: 8.00 mechanic :93 Median :16.00 Median :21.00 Median :11.00 dispatcher:66 Mean :15.64 Mean :20.68 Mean :10.59 3rd Qu.:19.00 3rd Qu.:25.00 3rd Qu.:13.00 Max. :28.00 Max. :35.00 Max. :20.00 id X6 Min. : 1.00 Length:244 1st Qu.:21.00 Class :character Median :41.00 Mode :character Mean :41.95 3rd Qu.:61.25 Max. :93.00  There’s no need for datagrid or crossing here because I’m not doing combinations of things. (I might have done that here, to get a sense for example of “what effect does a higher score on outdoor have on the likelihood of a person doing each job?”. But I didn’t.) Then feed this into predict as the second thing: pp1 <- predict(job.1, new) Our predictions are these: cbind(new, pp1) The class thing gives our predicted job, and the posterior probabilities say how sure we are about that. So we reckon there’s a 78% chance that this person is a mechanic; they might be a dispatcher but they are unlikely to be in customer service. Our best guess is that they are a mechanic.10 Does this pass the sanity-check test? First figure out where our new employee stands compared to the others: summary(jobs)  outdoor social conservative job Min. : 0.00 Min. : 7.00 Min. : 0.00 custserv :85 1st Qu.:13.00 1st Qu.:17.00 1st Qu.: 8.00 mechanic :93 Median :16.00 Median :21.00 Median :11.00 dispatcher:66 Mean :15.64 Mean :20.68 Mean :10.59 3rd Qu.:19.00 3rd Qu.:25.00 3rd Qu.:13.00 Max. :28.00 Max. :35.00 Max. :20.00 id X6 Min. : 1.00 Length:244 1st Qu.:21.00 Class :character Median :41.00 Mode :character Mean :41.95 3rd Qu.:61.25 Max. :93.00  Their score on outdoor is above average, but their scores on the other two scales are below average (right on the 1st quartile in each case). Go back to the table of means from the discriminant analysis output. The mechanics have the highest average for outdoor, they’re in the middle on social and they are lowish on conservative. Our new employee is at least somewhat like that. Or, we can figure out where our new employee sits on the plot. The output from predict gives the predicted LD1 and LD2, which are 0.71 and $$-1.02$$ respectively. This employee would sit to the right of and below the middle of the plot: in the greens, but with a few blues nearby: most likely a mechanic, possibly a dispatcher, but likely not customer service, as the posterior probabilities suggest. Extra: I can use the same mechanism to predict for a combination of values. This would allow for the variability of each of the original variables to differ, and enable us to assess the effect of, say, a change in conservative over its “typical range”, which we found out above with summary(jobs). I’ll take the quartiles, in my usual fashion: outdoors <- c(13, 19) socials <- c(17, 25) conservatives <- c(8, 13) The IQRs are not that different, which says that what we get here will not be that different from the coefficients of linear discriminants’’ above: new <- crossing( outdoor = outdoors, social = socials, conservative = conservatives ) pp2 <- predict(job.1, new) px <- round(pp2$x, 2)
cbind(new, pp2$class, px) The highest (most positive) LD1 score goes with high outdoor, low social, high conservative (and being a dispatcher). It is often interesting to look at the second-highest one as well: here that is low outdoor, and the same low social and high conservative as before. That means that outdoor has nothing much to do with LD1 score. Being low social is strongly associated with LD1 being positive, so that’s the important part of LD1. What about LD2? The most positive LD2 are these:  LD2 outdoor social conservative ==================================== 0.99 low low high 0.59 low high high 0.55 low low low  These most consistently go with outdoor being low. Is that consistent with the “coefficients of linear discriminants”? job.1$scaling
                     LD1         LD2
social       -0.19427415 -0.04978105
outdoor       0.09198065 -0.22501431
conservative  0.15499199  0.08734288

Very much so: outdoor has nothing much to do with LD1 and everything to do with LD2.

$$\blacksquare$$

1. Since I am not making you hand this one in, I’m going to keep going. Re-run the analysis to incorporate cross-validation, and make a table of the predicted group memberships. Is it much different from the previous one you had? Why would that be?

Solution

Stick a CV=T in the lda:

job.3 <- lda(job ~ social + outdoor + conservative, data = jobs, CV = T)
glimpse(job.3)
List of 5
$class : Factor w/ 3 levels "custserv","mechanic",..: 1 2 1 1 1 2 1 1 1 1 ...$ posterior: num [1:244, 1:3] 0.902 0.352 0.71 0.805 0.766 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:244] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "custserv" "mechanic" "dispatcher"
$terms :Classes 'terms', 'formula' language job ~ social + outdoor + conservative .. ..- attr(*, "variables")= language list(job, social, outdoor, conservative) .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ... .. .. ..- attr(*, "dimnames")=List of 2 .. ..- attr(*, "term.labels")= chr [1:3] "social" "outdoor" "conservative" .. ..- attr(*, "order")= int [1:3] 1 1 1 .. ..- attr(*, "intercept")= int 1 .. ..- attr(*, "response")= int 1 .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. ..- attr(*, "predvars")= language list(job, social, outdoor, conservative) .. ..- attr(*, "dataClasses")= Named chr [1:4] "factor" "numeric" "numeric" "numeric" .. .. ..- attr(*, "names")= chr [1:4] "job" "social" "outdoor" "conservative"$ call     : language lda(formula = job ~ social + outdoor + conservative, data = jobs, CV = T)
$xlevels : Named list() This directly contains a class (no need for a predict), so we make a data frame, with a different name since I shortly want to compare this one with the previous one: dcv <- cbind(jobs, class = job.3$class, posterior = job.3$posterior) head(dcv) This is a bit fiddlier than before because job.3 contains some things of different lengths and we can’t just cbind them all together. Then go straight to the table: with(dcv, table(job, class))  class job custserv mechanic dispatcher custserv 67 14 4 mechanic 16 67 10 dispatcher 3 14 49 This is almost exactly the same as we had in part (here): the cross-validation has made almost no difference. The reason for that is that here, we have lots of data (you can predict for one mechanic, say, and there are still lots of others to say that the mechanics are “over there”. This is in sharp contrast to the example in class with the bellydancers, where if you try to predict for one of the extreme ones, the notion of “where are the bellydancers” changes substantially. Here, I suspect that the few people whose predictions changed were ones where the posterior probabilities were almost equal for two jobs, and the cross-validation was just enough to tip the balance. You can check this, but there are a lot of posterior probabilities to look at! This is another way of saying that with small data sets, your conclusions are more “fragile” or less likely to be generalizable. With a larger data set like this one, cross-validation, which is the right thing to do, makes almost no difference.11 All right, I suppose I do want to investigate the individuals whose predicted jobs changed, and look at their posterior probabilities. I think I have the machinery to do that. Let’s start by gluing together the dataframes with the predictions from the regular lda (in d) and the ones from the cross-validation (in dcv). I think I can do that like this: d %>% left_join(dcv, join_by(id, job)) There’s already subtlety. The people are numbered separately within each actual job, so the thing that uniquely identifies each person (what database people call a “key”) is the combination of the actual job they do plus their id within that job. You might also think of using bind_cols, except that this adds a number to all the column names which is a pain to deal with. I don’t really need to look up the people in the second dataframe, since I know where they are (in the corresponding rows), but doing so seems to make everything else easier. The columns with an x on the end of their names came from d, that is, the predictions without cross-validation, and the ones with a y came from cross-validation. Let’s see if we can keep only the columns we need so that it’s a bit less unwieldy: d %>% left_join(dcv, join_by(id, job)) %>% select(outdoor = outdoor.x, social = social.x, conservative = conservative.x, job, id, starts_with("class"), starts_with("posterior")) -> all all That’s not too bad. We could shorten some variable names and reduce some decimal places, but that’ll do for now. How many individuals were predicted differently? We have columns called class.x (predicted group membership from original LDA) and class.y (from cross-validation), and so: all %>% filter(class.x != class.y) There are exactly two individuals that were predicted differently. Under cross-validation, they both got called mechanics. How do their posterior probabilities compare? These are all in columns beginning with posterior. We could scrutinize the output above, or try to make things simpler. Let’s round them to three decimals, and then display only some of the columns: all %>% filter(class.x != class.y) %>% mutate(across(starts_with("posterior"), $$post) round(post, 3))) %>% select(id, job, starts_with("posterior")) And then, because I can, let’s re-format that to make it easier to read, x being regular LDA and y being cross-validation: all %>% filter(class.x != class.y) %>% mutate(across(starts_with("posterior"), \(post) round(post, 3))) %>% select(id, job, starts_with("posterior")) %>% pivot_longer(starts_with("posterior"), names_to = c("post_job", "method"), names_pattern = "posterior\\.(.*)\\.(.)", values_to = "prob") %>% pivot_wider(names_from = method, values_from = prob) As I suspected, the posterior probabilities in each case are almost identical, but different ones happen to be slightly higher in the two cases. For the first individual (actually in customer service), cross-validation just tweaked the posterior probabilities enough to call that individual a mechanic, and for the second one, actually a dispatcher, the first analysis was almost too close to call, and things under cross-validation got nudged onto the mechanic side again. All right, what about those people who got misclassified (say, by the LDA rather than the cross-validation, since it seems not to make much difference)? Let’s count them first: all %>% mutate(is_correct = ifelse(job == class.x, "correct", "wrong")) -> all.mis all.mis %>% count(is_correct == "wrong") %>% mutate(proportion = n / sum(n)) 24% of them. There are a lot of them, so we’ll pick a random sample to look at, rounding the posterior probabilities to 3 decimals first and reducing the number of columns to look at: set.seed(457299) all.mis %>% filter(is_correct == "wrong") %>% mutate(across(starts_with("posterior"), \(post) round(post, 3))) %>% select( id, job, class.x, outdoor, social, conservative, starts_with("posterior") ) %>% sample_n(15) I put the set.seed in so that this will come out the same each time I do it, and so that the discussion below always makes sense. Now we can look at the true and predicted jobs for these people, and the posterior probabilities (which I rounded earlier). • The first one, id 6, is badly wrong; this was actually a mechanic, but the posterior probabilities say that it is a near-certain dispatcher. • The second one, id 65, is a little better, but the posterior probability of actually being a mechanic is only 0.269; the probability of being a dispatcher is much higher at 0.593, so that’s what it gets classified as. • The third one, though, id #61, is a very close call: posterior probability 0.438 of being in customer service (correct), 0.453 of being a dispatcher, only slightly higher, but enough to make the prediction wrong. The implication from looking at our sample of 15 people is that some of them are “badly” misclassified (with a high posterior probability of having a different job from the one they actually hold), but a lot of them came out on the wrong end of a close call. This suggests that a number of the correct classifications came out right almost by chance as well, with (hypothesizing) two close posterior probabilities of which their actual job came out slightly higher. Further further analysis would look at the original variables social, outdoor and conservative for the misclassified people, and try to find out what was unusual about them. But I think now would be an excellent place for me to stop. \(\blacksquare$$ ## 35.12 Observing children with ADHD A number of children with ADHD were observed by their mother or their father (only one parent observed each child). Each parent was asked to rate occurrences of behaviours of four different types, labelled q1 through q4 in the data set. Also recorded was the identity of the parent doing the observation for each child: 1 is father, 2 is mother. Can we tell (without looking at the parent column) which parent is doing the observation? Research suggests that rating the degree of impairment in different categories depends on who is doing the rating: for example, mothers may feel that a child has difficulty sitting still, while fathers, who might do more observing of a child at play, might think of such a child as simply being “active” or “just being a kid”. The data are in link. 1. Read in the data and confirm that you have four ratings and a column labelling the parent who made each observation. Solution As ever: my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/adhd-parents.txt" adhd <- read_delim(my_url, " ") Rows: 29 Columns: 5 ── Column specification ──────────────────────────────────────────────────────── Delimiter: " " chr (1): parent dbl (4): q1, q2, q3, q4 ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. adhd Yes, exactly that. $$\blacksquare$$ 1. Run a suitable discriminant analysis and display the output. Solution This is as before: adhd.1 <- lda(parent ~ q1 + q2 + q3 + q4, data = adhd) adhd.1 Call: lda(parent ~ q1 + q2 + q3 + q4, data = adhd) Prior probabilities of groups: father mother 0.1724138 0.8275862 Group means: q1 q2 q3 q4 father 1.800 1.000000 1.800000 1.800 mother 2.375 2.791667 1.958333 1.625 Coefficients of linear discriminants: LD1 q1 -0.3223454 q2 2.3219448 q3 0.1411360 q4 0.1884613 $$\blacksquare$$ 1. Which behaviour item or items seem to be most helpful at distinguishing the parent making the observations? Explain briefly. Solution Look at the Coefficients of Linear Discriminants. The coefficient of q2, 2.32, is much larger in size than the others, so it’s really q2 that distinguishes mothers and fathers. Note also that the group means for fathers and mothers are fairly close on all the items except for q2, which are a long way apart. So that’s another hint that it might be q2 that makes the difference. But that might be deceiving: one of the other qs, even though the means are close for mothers and fathers, might actually do a good job of distinguishing mothers from fathers, because it has a small SD overall. $$\blacksquare$$ 1. Obtain the predictions from the lda, and make a suitable plot of the discriminant scores, bearing in mind that you only have one LD. Do you think there will be any misclassifications? Explain briefly. Solution The prediction is the obvious thing. I take a quick look at it (using glimpse), but only because I feel like it: adhd.2 <- predict(adhd.1) glimpse(adhd.2) List of 3$ class    : Factor w/ 2 levels "father","mother": 1 2 1 2 2 2 2 2 2 2 ...
$posterior: num [1:29, 1:2] 9.98e-01 5.57e-06 9.98e-01 4.97e-02 4.10e-05 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:29] "1" "2" "3" "4" ...
.. ..$: chr [1:2] "father" "mother"$ x        : num [1:29, 1] -3.327 1.357 -3.327 -0.95 0.854 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:29] "1" "2" "3" "4" ... .. ..$ : chr "LD1"

The discriminant scores are in the thing called x in there. There is only LD1 (only two groups, mothers and fathers), so the right way to plot it is against the true groups, eg. by a boxplot, first making a data frame, using data.frame, containing what you need:

d <- cbind(adhd, adhd.2)
head(d)
ggplot(d, aes(x = parent, y = LD1)) + geom_boxplot()

The fathers look to be a very compact group with LD1 score around $$-3$$, so I don’t foresee any problems there. The mothers, on the other hand, have outliers: there is one with LD1 score beyond $$-3$$ that will certainly be mistaken for a father. There are a couple of other unusual LD1 scores among the mothers, but a rule like “anything above $$-2$$ is called a mother, anything below is called a father” will get these two right. So I expect that the one very low mother will get misclassified, but that’s the only one.

$$\blacksquare$$

1. Obtain the predicted group memberships and make a table of actual vs. predicted. Were there any misclassifications? Explain briefly.

Solution

Use the predictions from the previous part, and the observed parent values from the original data frame. Then use either table or tidyverse to summarize.

with(d, table(obs = parent, pred = class))
        pred
obs      father mother
father      5      0
mother      1     23

Or,

d %>% count(parent, class)

or

d %>%
count(parent, class) %>%
pivot_wider(names_from=class, values_from=n, values_fill = list(n=0))

One of the mothers got classified as a father (evidently that one with a very negative LD1 score), but everything else is correct.

This time, by “explain briefly” I mean something like “tell me how you know there are or are not misclassifications”, or “describe any misclassifications that occur” or something like that.

Extra: I was curious — what is it about that one mother that caused her to get misclassified? (I didn’t ask you to think further about this, but in case you are curious as well.)

First, which mother was it? Let’s begin by adding the predicted classification to the data frame, and then we can query it by asking to see only the rows where the actual parent and the predicted parent were different. I’m also going to create a column id that will give us the row of the original data frame:

d %>%
mutate(id = row_number()) %>%
filter(parent != class)

It was the original row 17. So what was unusual about this? We know from earlier that behaviour q2 was the one that generally distinguished mothers from fathers, so maybe we should find the mean and SD of scores for mothers and fathers on q2:

adhd %>% group_by(parent) %>% summarize(m2 = mean(q2), s2 = sd(q2))

The fathers’ scores on q2 were all 1, but the mothers’ scores on q2 were on average much higher. So it’s not really a surprise that this mother was mistaken for a father.

$$\blacksquare$$

1. Re-run the discriminant analysis using cross-validation, and again obtain a table of actual and predicted parents. Is the pattern of misclassification different from before? Hints: (i) Bear in mind that there is no predict step this time, because the cross-validation output includes predictions; (ii) use a different name for the predictions this time because we are going to do a comparison in a moment.

Solution

So, this, with different name:

adhd.3 <- lda(parent ~ q1 + q2 + q3 + q4, data = adhd, CV = T)
dd <- cbind(adhd, class = adhd.3$class, posterior = adhd.3$posterior)
with(dd, table(parent, class))
        class
parent   father mother
father      5      0
mother      1     23

It’s exactly the same pattern of misclassification. (In fact, it’s exactly the same mother being misclassified as a father.)

This one is the same not because of having lots of data. In fact, as you see below, having a small data set makes quite a bit of difference to the posterior probabilities (where they are not close to 1 or 0), but the decisions about whether the parents are a mother or a father are clear-cut enough that none of those change. Even though (some of) the posterior probabilities are noticeably changed, which one is the bigger has not changed at all.

$$\blacksquare$$

1. Display the original data (that you read in from the data file) side by side with two sets of posterior probabilities: the ones that you obtained with predict before, and the ones from the cross-validated analysis. Comment briefly on whether the two sets of posterior probabilities are similar. Hints: (i) use data.frame rather than cbind, for reasons that I explain elsewhere; (ii) round the posterior probabilities to 3 decimals before you display them. There are only 29 rows, so look at them all. I am going to add the LD1 scores to my output and sort by that, but you don’t need to. (This is for something I am going to add later.)

Solution

We have two data frames, d and dd12
that respectively contain everything from the (original) lda output and the cross-validated output. Let’s glue them together, look at what we have, and then pull out what we need:

all <- data.frame(d, dd)
head(all)

The ones with a 1 on the end are the cross-validated ones. We need the posterior probabilities, rounded, and they need to have shorter names:

all %>%
select(parent, starts_with("posterior"), LD1) %>%
mutate(across(starts_with("posterior"), $$x) round(x, 3))) %>% rename_with( ~ str_replace(., "posterior", "p"), starts_with("posterior"), ) %>% arrange(LD1) The rename_with changes the names of the columns that start with posterior to start with p instead (shorter). I learned about this today (having wondered whether it existed or not), and it took about three goes for me to get it right.13 The first column is the actual parent; the other five columns are: the posterior probabilities from before, for father and for mother (two columns), and the posterior probabilities from cross-validation for father and for mother (two more columns), and the LD1 scores from before, sorted into order. You might have these the other way around from me, but in any case you ought to make it clear which is which. I included the LD1 scores for my discussion below; you don’t need to. Are the two sets of posterior probabilities similar? Only kinda. The ones at the top and bottom of the list are without doubt respectively fathers at the top of the list (top 5 rows on my sorted output, except that one of those is actually a mother), or mothers at the bottom, from row 10 down. But for rows 6 through 9, the posterior probabilities are not that similar. The most dissimilar ones are in row 4, where the regular lda gives a posterior probability of 0.050 that the parent is a father, but under cross-validation that goes all the way up to 0.236. I think this is one of those mothers that is a bit like a father: her score on q2 was only 2, compared to 3 for most of the mothers. If you take out this mother, as cross-validation does, there are noticeably fewer q2=2 mothers left, so the observation looks more like a father than it would otherwise. \(\blacksquare$$

1. Row 17 of your (original) data frame above, row 5 of the output in the previous part, is the mother that was misclassified as a father. Why is it that the cross-validated posterior probabilities are 1 and 0, while the previous posterior probabilities are a bit less than 1 and a bit more than 0?

Solution

In making the classification, the non-cross-validated procedure uses all the data, so that parent #17 suggests that the mothers are very variable on q2, so it is conceivable (though still unlikely) that this parent actually is a mother. Under cross-validation, however, parent #17 is omitted. This mother is nothing like any of the other mothers, or, to put it another way, the remaining mothers as a group are very far away from this one, so #17 doesn’t look like a mother at all.

$$\blacksquare$$

1. Find the parents where the cross-validated posterior probability of being a father is “non-trivial”: that is, not close to zero and not close to 1. (You will have to make a judgement about what “close to zero or 1” means for you.) What do these parents have in common, all of them or most of them?

Solution

Let’s add something to the output we had before: the original scores on q1 through q4:

all %>%
select(q1:q4, parent, starts_with("posterior"), LD1) %>%
mutate(across(starts_with("posterior"), $$x) round(x, 3))) %>% rename_with( \(y) str_replace(y, "posterior", "p"), starts_with("posterior")) %>% arrange(LD1) To my mind, the “non-trivial” posterior probabilities are in rows 5 through 9. (You might have drawn the line in a different place.) These are the ones where there was some doubt, though maybe only a little, about which parent actually gave the ratings. For three of these, the parent (that was actually a mother) gave a rating of 2 on q2. These were the only 2’s on q2. The others were easy to call: “mother” if 3 and “father” if 1, and you’d get them all right except for that outlying mother. The clue in looking at q2 was that we found earlier that LD1 contained mostly q2, so that it was mainly q2 that separated the fathers and mothers. If you found something else that the “non-trivial” rows had in common, that is good too, but I think looking at q2 is your quickest route to an answer. (q1=1 picks out some of these, but not all of them.) This is really the same kind of issue as we discussed when comparing the posterior probabilities for lda and cross-validation above: there were only a few parents with q2=2, so the effect there is that under cross-validation, there are even fewer when you take one of them out. \(\blacksquare$$

## 35.13 Growing corn

A new type of corn seed has been developed. The people developing it want to know if the type of soil the seed is planted in has an impact on how well the seed performs, and if so, what kind of impact. Three outcome measures were used: the yield of corn produced (from a fixed amount of seed), the amount of water needed, and the amount of herbicide needed. The data are in link. 32 fields were planted with the seed, 8 fields with each soil type.

1. Read in the data and verify that you have 32 observations with the correct variables.

Solution

The usual:

my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/cornseed.csv"
cornseed <- read_csv(my_url)
Rows: 32 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): soil
dbl (4): field, yield, water, herbicide

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
cornseed

We have 32 rows; we have a categorical soil type, three numerical columns containing the yield, water and herbicide values, and we also have a label for each of the 32 fields (which is actually a number, but we don’t have to worry about that, since we won’t be using field for anything).

$$\blacksquare$$

1. Run a multivariate analysis of variance to see whether the type of soil has any effect on any of the variables. What do you conclude from it?

Solution

The usual thing: create the response, use manova (or Manova from car if you like, but it’s not necessary):

response <- with(cornseed, cbind(yield, water, herbicide))
cornseed.1 <- manova(response ~ soil, data = cornseed)
summary(cornseed.1)
          Df Pillai approx F num Df den Df  Pr(>F)
soil       3 0.5345   2.0234      9     84 0.04641 *
Residuals 28
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a P-value (just) less than 0.05, soil type has some effect on the response variables: that is, it affects one or more of the three responses, or some combination of them. ANOVA conclusions are usually vague, and MANOVA conclusions are vaguer than most. We will try to improve on this. But with an only-just-significant P-value, we should not be expecting miracles.

We ought to check Box’s M test:

summary(BoxM(response, cornseed$soil))  Box's M Test Chi-Squared Value = 12.07432 , df = 18 and p-value: 0.843  No problems with unequal variances or correlations, at least. Here and below, field is neither a response variable nor an explanatory variable; it is an experimental unit, so field acts as an ID rather than anything else. So field should not be part of any of the analyses; if it did appear, the only way it could is as a factor, for example if this was somehow a repeated measures analysis over the three response variables. In that case, lmer, if you were going that way, would use field as a random effect. The variables to include are the yield, water and herbicide as measured response variables, and soil type, as the categorical explanatory variable. (For the discriminant analysis, these get turned around: the grouping variable soil acts like a response and the others act as explanatory.) $$\blacksquare$$ 1. Run a discriminant analysis on these data, “predicting” soil type from the three response variables. Display the results. Solution cornseed.2 <- lda(soil ~ yield + water + herbicide, data = cornseed) cornseed.2 Call: lda(soil ~ yield + water + herbicide, data = cornseed) Prior probabilities of groups: clay loam salty sandy 0.25 0.25 0.25 0.25 Group means: yield water herbicide clay 58.8375 33.0875 4.0875 loam 69.7125 32.7375 6.3875 salty 55.3125 30.6375 3.8625 sandy 62.5750 28.2000 4.3500 Coefficients of linear discriminants: LD1 LD2 LD3 yield -0.08074845 -0.02081174 -0.04822432 water 0.03759961 0.09598577 -0.03231897 herbicide -0.50654017 0.06979662 0.27281743 Proportion of trace: LD1 LD2 LD3 0.9487 0.0456 0.0057  No field in here, for reasons discussed above. (I’m not even sure how you can run a discriminant analysis with a factor variable on the right of the squiggle.) The fields were numbered by soil type: cornseed %>% select(field, soil) so evidently if you know the field number you can guess the field type, but we didn’t care about that: we cared about whether you can distinguish the fields by yield, water, herbicide or combination thereof. $$\blacksquare$$ 1. * Which linear discriminants seem to be worth paying attention to? Why did you get three linear discriminants? Explain briefly. Solution Look for “proportion of trace” in the output. The first one is way bigger than the others, which says that the first linear discriminant is way more important (at separating the groups) than either of the other two. As to why we got three: there are 3 variables and 4 groups (soil types), and the smaller of 3 and $$4-1$$ is 3. $$\blacksquare$$ 1. Which response variables do the important linear discriminants depend on? Answer this by extracting something from your discriminant analysis output. Solution The table “coefficients of linear discriminants”. We said earlier that the only important discriminant is LD1. On that, the only notably non-zero coefficient is for herbicide; the ones for yield and water are close to zero. That is to say, the effects of the soil types play out through herbicide and not either of the other two variables. I didn’t ask you to, but you could check this by seeing how herbicide differs according to soil type: ggplot(cornseed, aes(x = soil, y = herbicide)) + geom_boxplot() The fields in loam soil needed more herbicide than the others. Or by water: ggplot(cornseed, aes(x = soil, y = water)) + geom_boxplot() There isn’t much difference in the amount of water needed between any of the fields, no matter what soil type. This confirms that water is not distinguished by soil type, while herbicide is (at least to some extent). $$\blacksquare$$ 1. Obtain predictions for the discriminant analysis. (You don’t need to do anything with them yet.) Solution Just this, therefore: cornseed.pred <- predict(cornseed.2) $$\blacksquare$$ 1. Plot the first two discriminant scores against each other, coloured by soil type. You’ll have to start by making a data frame containing what you need. Solution I changed my mind from the past about how to do this. I make a big data frame out of the data and predictions (with cbind) and go from there: d <- cbind(cornseed, cornseed.pred) head(d) Then we use this as input to ggplot: ggplot(d, aes(x = x.LD1, y = x.LD2, colour = soil)) + geom_point() $$\blacksquare$$ 1. On your plot that you just made, explain briefly how LD1 distinguishes at least one of the soil types. Solution Find a soil type that is typically high (or low or average) on LD1. Any one or more of these will do: loam soils are typically high on LD1, clay soils or salty soils are typically low on LD1; sandy soils are typically average on LD1. (There are exceptions, but I’m looking for “typically”.) $$\blacksquare$$ 1. On your plot, does LD2 appear to do anything to separate the groups? Is this surprising given your earlier findings? Explain briefly. Solution All the soil groups appear go to about the full height of the plot: that is to say, none of the groups appear to be especially at the top or the bottom. That means that LD2 does not separate the groups at all. Back in part (here), we said that the first linear discriminant is way more important than either of the other two, and here we see what that means: LD2 does nothing to separate the groups. So it’s not a surprising finding at all. I thought earlier about asking you to plot only the first linear discriminant, and now we see why: only the first one separates the groups. If you wanted to do that, you could make a boxplot of the discriminant scores by soil group, thus: ggplot(d, aes(x = soil, y = x.LD1)) + geom_boxplot() This says more or less the same thing as your plot of LD1 and LD2: loam has the highest LD1 score, sandy is about in the middle, and clay and salty have typically negative LD1 scores, similar to each other, though there is one outlying salty that looks a lot more like a loam. $$\blacksquare$$ 1. Make a table of actual and predicted soil group. Which soil type was classified correctly the most often? Solution with(d, table(obs = soil, pred = class))  pred obs clay loam salty sandy clay 3 0 3 2 loam 0 6 0 2 salty 1 1 5 1 sandy 2 1 1 4 Or, the tidyverse way, which is below. There were 8 fields of each soil type. The soil type that has the most of its fields classified correctly (based on the values of the response variables) has the biggest number down the diagonal of the table: looking at 3, 6, 5 and 4, we see that the loam soil type had the most of its fields classified correctly, so this was the most distinct from the others. (We also saw this on the plot of LD1 vs. LD2: the loam fields were all over on the right.) This was easier because we had the same number of fields of each type. If we didn’t have that, the right way to go then would be to work out row percentages: “out of the fields that were actually sandy, what percent of them got classified as sandy”, and so on. This is not a perfect classification, though, which is about what you would expect from the soil types being intermingled on the plot of LD1 vs. LD2. If you look at the table, salty and sandy are fairly distinct also, but clay is often confused with both of them. On the plot of LD1 and LD2, salty is generally to the left of sandy, but clay is mixed up with them both. The tidyverse way of doing this is equally good. This is the tidied-up way: d %>% count(soil, class) %>% pivot_wider(names_from = class, values_from = n, values_fill = 0) Six out of eight loams were correctly classified, which is better than anything else. Extra: we can calculate misclassification rates, first overall, which is easier: d %>% count(soil, class) %>% mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>% count(soil_stat, wt = n) d %>% count(soil, class) %>% mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>% count(soil_stat, wt = n) %>% mutate(prop = nn / sum(nn)) Error in mutate(): ℹ In argument: prop = nn/sum(nn). Caused by error: ! object 'nn' not found Note the use of wt on the second count to count the number of observations from the first count, not the number of rows. This shows that 44% of the soil types were misclassified, which sounds awful, but is actually not so bad, considering. Bear in mind that if you were just guessing, you’d get 75% of them wrong, so getting 44% wrong is quite a bit better than that. The variables (especially herbicide) are at least somewhat informative about soil type; it’s better to know them than not to. Or do it by actual soil type: d %>% count(soil, class) %>% group_by(soil) %>% mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>% count(soil_stat, wt = n) d %>% count(soil, class) %>% group_by(soil) %>% mutate(soil_stat = ifelse(soil == class, "correct", "wrong")) %>% count(soil_stat, wt = n) %>% mutate(prop = n / sum(n)) %>% select(-n) %>% pivot_wider(names_from=soil_stat, values_from=prop) Loam soil was the easiest to get right, and clay was easiest to get wrong. However, these proportions were each based on only eight observations, so it’s probably wise not to say that loam is always easiest to get right. I didn’t have you look at posterior probabilities here.14 With 32 fields, this is rather a lot to list them all, but what we can do is to look at the ones that were misclassified (the true soil type differs from the predicted soil type). Before that, though, we need to make a data frame with the stuff in it that we want to look at. And before that, I want to round the posterior probabilities to a small number of decimals. Then, we can fire away with this: d %>% mutate(across(starts_with("posterior"), $$post) round(post, 3))) %>% mutate(row = row_number()) -> dd dd %>% filter(soil != class) Most of the posterior probabilities are neither especially small nor especially large, which adds to the impression that things are really rather uncertain. For example, field 8 could have been either loam (0.352) or sandy (0.373). There was one field that was actually salty but looked like a loam one (with LD1 score around 2); this is field 20, that needed a lot of herbicide; it was rated to have an 82% chance of being loam and only 1% chance of salty. Let’s remind ourselves of why we were doing this: the MANOVA was significant, so at least some of the fields were different on some of the variables from some of the others. What we found by doing the discriminant analysis was that only the first discriminant was of any value in distinguishing the soil types by the variables we measured, and that was mostly herbicide. So the major effect that soil type had was on the amount of herbicide needed, with the loam soils needing most. I wanted to finish with one more thing, which was to look again at the soils that were actually loam: dd %>% filter(soil == "loam") %>% select(soil, yield, water, herbicide, class, starts_with("posterior")) Fields 7 and 8 could have been pretty much any type of soil; sandy came out with the highest posterior probability, so that’s what they were predicted (wrongly) to be. Some of the fields, 1, 3 and 5, were clearly (and correctly) loam. For 1 and 5, you can clearly see that this is because herbicide was high, but field 3 is more of a mystery. For this field, herbicide is not high, so one or more of the other variables must be pointing towards loam. We can obtain predicted LD1 scores for various combinations of “typical” values of the response variables and see what has what effect on LD1: summary(cornseed)  field soil yield water Min. : 1.00 Length:32 Min. :45.00 Min. :14.50 1st Qu.: 8.75 Class :character 1st Qu.:50.58 1st Qu.:25.75 Median :16.50 Mode :character Median :61.40 Median :29.60 Mean :16.50 Mean :61.61 Mean :31.17 3rd Qu.:24.25 3rd Qu.:67.00 3rd Qu.:36.83 Max. :32.00 Max. :96.10 Max. :54.20 herbicide Min. : 1.100 1st Qu.: 3.075 Median : 4.750 Mean : 4.672 3rd Qu.: 5.825 Max. :11.700  The problem is that the variables have different spreads. Let’s do some predictions (ie. calculations) of LD1 score for combinations of quartiles of our response variables. I like quartiles because these are “representative” values of the variables, typical of how far up and down they go. This process is one you’ve seen before: yields <- c(51, 67) waters <- c(26, 37) herbicides <- c(3, 6) new <- datagrid(model = cornseed.2, yield = yields, water = waters, herbicide = herbicides) pred <- predict(cornseed.2, new) cbind(new, predx) %>% arrange(desc(LD1)) I arranged the predicted LD1 scores in descending order, so the most loam-like combinations are at the top. The top two combinations look like loam; they both have high herbicide, as we figured before. But they also have high yield. That might go some way towards explaining why field 3, with its non-high herbicide, was confidently predicted to be loam: cornseed %>% filter(field == 3) This has a very high yield, and that is what is making us (correctly) think it is loam. I suddenly remembered that I hadn’t done a biplot of this one, which I could, since it’s a discriminant analysis: ggbiplot(cornseed.2, groups = cornseedsoil) This shows the dominant influence of herbicide on LD1 score (more herbicide is more positive), and that water has nothing to say (in terms of distinguishing soil types) and yield has not much to say, their arrows being short. That observation with a non-high herbicide that was predicted to be had the highest yield of all, so even the small influence of yield on LD1 made a big difference here. \(\blacksquare$$ ## 35.14 Understanding athletes’ height, weight, sport and gender On a previous assignment, we used MANOVA on the athletes data to demonstrate that there was a significant relationship between the combination of the athletes’ height and weight, with the sport they play and the athlete’s gender. The problem with MANOVA is that it doesn’t give any information about the kind of relationship. To understand that, we need to do discriminant analysis, which is the purpose of this question. The data can be found at link. 1. Once again, read in and display (some of) the data, bearing in mind that the data values are separated by tabs. (This ought to be a free two marks.) Solution Nothing new here: my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/ais.txt" athletes <- read_tsv(my_url) Rows: 202 Columns: 13 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (2): Sex, Sport dbl (11): RCC, WCC, Hc, Hg, Ferr, BMI, SSF, %Bfat, LBM, Ht, Wt ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. athletes $$\blacksquare$$ 1. Use unite to make a new column in your data frame which contains the sport-gender combination. Display it. (You might like to display only a few columns so that it is clear that you did the right thing.) Hint: you’ve seen unite in the peanuts example in class. Solution The columns to combine are called Sport and Sex, with Capital Letters. The syntax for unite is that you give the name of the new combo column first, and then the names of the columns you want to combine, either by listing them or by using a select-helper. They will be separated by an underscore by default, which is usually easiest to handle.15 In unite, you can group the columns to “unite” with c(), as in class, or not, as here. Either way is good.16 We’ll be using height and weight in the analysis to come, so I decided to display just those: athletes %>% unite(combo, Sport, Sex) -> athletesc athletesc %>% select(combo, Ht, Wt) I gave the data frame a new name, since I might want to come back to the original later. Also, displaying only those columns gives more width for the display of my combo, so that I can be sure I got it right. Extra: there is another column, SSF, that begins with S, so the select-helper thing is not so obviously helpful here. But the two columns we want start with S followed by either e or p, so we could do this: athletes %>% unite(combo, matches("^S(e|p)")) %>% select(combo, Ht, Wt) The matches takes a so-called regular expression. This one says starting at the beginning of the column name, find an uppercase S followed by either a lowercase e or a lowercase p’’. This picks out the columns and only the columns we want. In the opposite order, though (either order is fine). I have a feeling we can also take advantage of the fact that the two columns we want to unite are the only two text ones: athletes %>% unite(combo, where(is.character)) I wasn’t expecting that to work! $$\blacksquare$$ 1. Run a discriminant analysis “predicting” sport-gender combo from height and weight. Display the results. (No comment needed yet.) Solution That would be this. I’m having my familiar trouble with names: combo.1 <- lda(combo ~ Ht + Wt, data = athletesc) If you used a new name for the data frame with the sport-gender combinations in it, use that new name here. The output: combo.1 Call: lda(combo ~ Ht + Wt, data = athletesc) Prior probabilities of groups: BBall_female BBall_male Field_female Field_male Gym_female 0.06435644 0.05940594 0.03465347 0.05940594 0.01980198 Netball_female Row_female Row_male Swim_female Swim_male 0.11386139 0.10891089 0.07425743 0.04455446 0.06435644 T400m_female T400m_male Tennis_female Tennis_male TSprnt_female 0.05445545 0.08910891 0.03465347 0.01980198 0.01980198 TSprnt_male WPolo_male 0.05445545 0.08415842 Group means: Ht Wt BBall_female 182.2692 71.33077 BBall_male 195.5833 88.92500 Field_female 172.5857 80.04286 Field_male 185.2750 95.76250 Gym_female 153.4250 43.62500 Netball_female 176.0870 69.59348 Row_female 178.8591 72.90000 Row_male 187.5333 86.80667 Swim_female 173.1778 65.73333 Swim_male 185.6462 81.66154 T400m_female 169.3364 57.23636 T400m_male 179.1889 68.20833 Tennis_female 168.5714 58.22857 Tennis_male 183.9500 75.40000 TSprnt_female 170.4750 59.72500 TSprnt_male 178.5364 75.79091 WPolo_male 188.2235 86.72941 Coefficients of linear discriminants: LD1 LD2 Ht 0.08898971 -0.1888615 Wt 0.06825230 0.1305246 Proportion of trace: LD1 LD2 0.7877 0.2123  I comment here that there are two linear discriminants because there are two variables (height and weight) and actually 17 groups (not quite $$2\times 10$$ because some sports are played by athletes of only one gender). The smaller of 2 and $$17-1$$ is 2. (I often ask about this, but am choosing not to here.) $$\blacksquare$$ 1. What kind of height and weight would make an athlete have a large (positive) score on LD1? Explain briefly. Solution The Coefficients of Linear Discriminants for LD1 are both positive, so an athlete with a large positive score on LD1 has a large height and weight: that is to say, they are tall and heavy. $$\blacksquare$$ 1. Make a guess at the sport-gender combination that has the highest score on LD1. Why did you choose the combination you did? Solution I could have made you guess the smallest score on LD1, but that would have been too easy (female gymnasts). For this one, you want a sport-gender combination that is typically tall and heavy, and you can look in the table of Group Means to help you find a candidate group. I think the two best guesses are male basketball players (tallest and nearly the heaviest) and male field athletes (heaviest and among the group of athletes that are second-tallest behind the male basketball players). I don’t so much mind what you guess, as long as you make a sensible call about a group that is reasonably tall and reasonably heavy (or, I suppose, that matches with what you said in the previous part, whatever that was). $$\blacksquare$$ 1. What combination of height and weight would make an athlete have a small* (that is, very negative) score on LD2? Explain briefly. Solution The italics in the question are something to do with questions that have a link to them in Bookdown. I don’t know how to fix that. Going back to the Coefficients of Linear Discriminants, the coefficient for Height is negative, and the one for Weight is positive. What will make an athlete come out small (very negative) on this is if they have a large height and a small weight. To clarify your thinking on this, think of the heights and weights as being standardized, so that a big one will be positive and a small one will be negative. To make LD2 very negative, you want a “plus” height to multiply the minus sign, and a “minus” weight multiplying the plus sign. Extra: what is happening here is that LD1 gives the most important way in which the groups differ, and LD2 the next-most important. There is generally a positive correlation between height and weight (taller athletes are generally heavier), so the most important “dimension” is the big-small one with tall heavy athletes at one end and short light athletes at the other. The Proportion of trace in the output says that LD1 is definitely more important, in terms of separating the groups, than LD2 is, but the latter still has some value. $$\blacksquare$$ 1. Obtain predictions for the discriminant analysis, and use these to make a plot of LD1 score against LD2 score, with the individual athletes distinguished by what sport they play and gender they are. (You can use colour to distinguish them, or you can use shapes. If you want to go the latter way, there are clues in my solutions to the MANOVA question about these athletes.) Solution The prediction part is only one step: p <- predict(combo.1) One point for this. This, in case you are wondering, is obtaining predicted group membership and LD scores for the original data, that is, for our 202 athletes. I prefer (no obligation) to take a look at what I have. My p is actually a list: class(p) [1] "list" glimpse(p) List of 3$ class    : Factor w/ 17 levels "BBall_female",..: 12 6 6 6 7 6 6 6 6 6 ...
$posterior: num [1:202, 1:17] 0.1235 0.0493 0.084 0.0282 0.1538 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:202] "1" "2" "3" "4" ...
.. ..$: chr [1:17] "BBall_female" "BBall_male" "Field_female" "Field_male" ...$ x        : num [1:202, 1:2] -1.325 -1.487 -0.96 -1.885 0.114 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:202] "1" "2" "3" "4" ... .. ..$ : chr [1:2] "LD1" "LD2"

Our standard procedure is to cbind the predictions together with the original data (including the combo), and get a huge data frame (in this case):

d <- cbind(athletesc, p)
head(d)

And so, to the graph:

ggplot(d, aes(x = x.LD1, y = x.LD2, colour = combo)) + geom_point()

If you can distinguish seventeen different colours, your eyes are better than mine! You might prefer to use seventeen different shapes, although I wonder how much better that will be:

ggplot(d, aes(x = x.LD1, y = x.LD2, shape = combo)) + geom_point() +
scale_shape_manual(values = 1:17)

You have to do something special to get as many as seventeen shapes. This idea came from the MANOVA question in the last assignment.

Or even this:

ggplot(d, aes(x = x.LD1, y = x.LD2, shape = combo, colour = combo)) + geom_point() +
scale_shape_manual(values = 1:17)

Perhaps having colours and shapes makes the combos easier to distinguish. We’re beginning to stray onto the boundary between statistics and aesthetics here!

Extra: earlier, I asked you to guess which group(s) of athletes had a high (positive) score on LD1. These are the ones on the right side of this plot: male basketball players bottom right and male field athletes top right. Was that what you guessed? What about the other guesses you might have made?

$$\blacksquare$$

1. Look on your graph for the four athletes with the smallest (most negative) scores on LD2. What do they have in common? Does this make sense, given your answer to part (here)? Explain briefly.

Solution

These are the four athletes at the bottom of the plot. If you can distinguish the colours, two of these are red and two of them are orange, so they are all basketball players (two male and two female). If you plotted the shapes, and you used the same shapes I did, two of them are circles and the other two are upward-facing triangles, leading you to the same conclusion. (You could also denote each combo by a letter and plot with those letters, as per the solutions to the last assignment.) Back in part (here), I said that what would make an athlete come out very negative on LD2 is if they were tall and not heavy. This is the stereotypical description of a basketball player, so it makes perfect sense to me.

Extra: some basketball players are tall and heavier; these are the ones on the right of the plot, with a larger LD1 score, to reflect that they are both tall and heavy, but with an LD2 score closer to zero, reflecting that, given how tall they are, their weight is about what you’d expect. LD2 is really saying something like “weight relative to height”, with someone at the top of the picture being unusually heavy and someone at the bottom unusually light.

$$\blacksquare$$

1. Obtain a (very large) square table, or a (very long) table with frequencies, of actual and predicted sport-gender combinations. You will probably have to make the square table very small to fit it on the page. For that, displaying the columns in two or more sets is OK (for example, six columns and all the rows, six more columns and all the rows, then the last five columns for all the rows). Are there any sport-gender combinations that seem relatively easy to classify correctly? Explain briefly.

Solution

Let’s see what happens:

tab <- with(d, table(combo, class))
tab
                class
combo            BBall_female BBall_male Field_female Field_male Gym_female
BBall_female              3          1            0          0          0
BBall_male                0          9            0          0          0
Field_female              0          0            5          0          0
Field_male                0          1            0          7          0
Gym_female                0          0            0          0          4
Netball_female            0          0            1          0          0
Row_female                0          0            0          0          1
Row_male                  0          2            0          1          0
Swim_female               0          0            0          0          0
Swim_male                 0          4            0          0          0
T400m_female              0          0            0          0          0
T400m_male                3          1            0          0          0
Tennis_female             0          0            1          0          1
Tennis_male               1          0            0          0          0
TSprnt_female             0          0            0          0          0
TSprnt_male               0          0            0          0          0
WPolo_male                1          3            0          2          0
class
combo            Netball_female Row_female Row_male Swim_female Swim_male
BBall_female                5          1        0           0         0
BBall_male                  0          0        0           0         2
Field_female                1          0        0           0         0
Field_male                  0          2        0           0         0
Gym_female                  0          0        0           0         0
Netball_female             13          4        0           0         0
Row_female                  5         10        0           0         1
Row_male                    0          0        1           0         0
Swim_female                 4          1        0           0         0
Swim_male                   2          3        0           0         0
T400m_female                3          0        0           0         0
T400m_male                  5          3        0           0         0
Tennis_female               2          0        0           0         0
Tennis_male                 0          3        0           0         0
TSprnt_female               1          0        0           0         0
TSprnt_male                 6          3        0           0         0
WPolo_male                  0          3        1           0         0
class
combo            T400m_female T400m_male Tennis_female Tennis_male
BBall_female              0          2             0           0
BBall_male                0          0             0           0
Field_female              1          0             0           0
Field_male                0          0             0           0
Gym_female                0          0             0           0
Netball_female            1          4             0           0
Row_female                0          4             0           0
Row_male                  1          0             0           0
Swim_female               3          1             0           0
Swim_male                 0          1             0           0
T400m_female              6          2             0           0
T400m_male                1          5             0           0
Tennis_female             2          1             0           0
Tennis_male               0          0             0           0
TSprnt_female             2          1             0           0
TSprnt_male               0          0             0           0
WPolo_male                0          0             0           0
class
combo            TSprnt_female TSprnt_male WPolo_male
BBall_female               0           0          1
BBall_male                 0           0          1
Field_female               0           0          0
Field_male                 0           0          2
Gym_female                 0           0          0
Netball_female             0           0          0
Row_female                 0           0          1
Row_male                   0           0         10
Swim_female                0           0          0
Swim_male                  0           0          3
T400m_female               0           0          0
T400m_male                 0           0          0
Tennis_female              0           0          0
Tennis_male                0           0          0
TSprnt_female              0           0          0
TSprnt_male                0           0          2
WPolo_male                 0           0          7

That’s kind of long.

For combos that are easy to classify, you’re looking for a largish number on the diagonal of the table (classified correctly), bearing in mind that you only see about four columns of the table at once, and (much) smaller numbers in the rest of the row and column. I don’t mind which ones you pick out, but see if you can find a few:

• Male basketball players (9 out of 12 classified correctly)

• Male field athletes (7 out of 10 classified correctly)

• Female netball players (13 out of about 23)

• Female rowers (10 out of about 22)

Or you can turn it into a tibble:

tab %>% as_tibble()

This makes the tidyverse output, with frequencies. You probably want to omit the zero ones:

tab %>% as_tibble() %>% filter(n > 0)

This is the same output as below. See there for comments.

The other, perhaps easier, way to tackle this one is the tidyverse way, making a “long” table of frequencies. Here is some of it. You’ll be able to click to see more:

d %>% count(combo, class)

The zeroes never show up here. The combo column is the truth, and the class column is the prediction. Again, you can see where the big frequencies are; a lot of the female netball players were gotten right, but there were a lot of them to begin with.

Extra: let’s see if we can work out proportions correct. I’ve changed my mind from how I originally wrote this. I still use count, but I start with the overall misclassification. Let’s take it in steps:

d %>%
count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong"))

That makes a new column stat that contains whether the predicted sport-gender combination was correct or wrong. For an overall misclassification rate we have to count these, but not simply counting the number of rows; rather, we need to total up the things in the n column:

d %>%
count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n)

This tells us how many predictions overall were right and how many wrong.

To make those into proportions, another mutate, dividing by the total of n:

d %>%
count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n))

65% of the sport-gender combinations were misclassified. This is awful, but is a lot better than guessing (we’d then get about 5% of them right and about 95% wrong).

There’s a subtlety here that will make sense when we do the corresponding calculation by sport-gender combination. To do that, we put a group_by(combo) either before or after we define stat (it doesn’t matter which way):

d %>%
count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n))

That last sum(n): what is it summing over? The answer is “within combo”, since that is the group_by. You see that the two proportion values within, say, BBall_female, add up to 1.

We don’t actually see all the answers, because there are too many of them. Let’s try to get the proportion correct and wrong in their own columns. This almost works:

d %>%
count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
pivot_wider(names_from=stat, values_from=proportion)

This doesn’t work because everything outside of the pivot_wider is tested for uniqueness; if it’s unique, it gets its own row. Thus, BBall_male and 3 is different from BBall_male and 9. But we only want one row of BBall_male. I think the easiest way around this is to get rid of n, since it has served its purpose:

d %>%
count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=stat, values_from=proportion, values_fill = list(proportion=0))

One extra thing: some of the proportion values were missing, because there weren’t any misclassified (or maybe correctly-classified!) athletes. The values_fill sets any missings in proportion to zero.

While we’re about it, let’s arrange in order of misclassification probability:

d %>%
count(combo, class) %>%
group_by(combo) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
count(stat, wt = n) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from=stat, values_from=proportion, values_fill = list(proportion=0)) %>%
replace_na(list(correct = 0, wrong = 0)) %>%
arrange(wrong)

The most distinctive athletes were the female gymnasts (tiny!), followed by the male basketball players (tall) and the female field athletes (heavy). These were easiest to predict from their height and weight. The ones at the bottom of the list were very confusible since the discriminant analysis guessed them all wrong! So what were the most common misclassifications? Let’s go back to this:

head(d)
d %>%
count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong"))

We want to express those n values as proportions out of their actual sport-gender combo, so we group by combo before defining the proportions:

d %>%
count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
group_by(combo) %>%
mutate(proportion = n / sum(n))

Only pick out the ones that were gotten wrong, and arrange the remaining proportions in descending order:

d %>%
count(combo, class) %>%
mutate(stat = ifelse(combo == class, "correct", "wrong")) %>%
group_by(combo) %>%
mutate(proportion = n / sum(n)) %>%
filter(stat == "wrong") %>%
arrange(desc(proportion))

The embarrassment champion is the three male tennis players that were taken to be — female rowers! Most of the other mistakes are more forgivable: the male rowers being taken for male water polo players, for example.

$$\blacksquare$$

1. Grammatically, I am supposed to write this as “two hundred and forty-four” in words, since I am not supposed to start a sentence with a number. But, I say, deal with it. Or, I suppose, “there are 244 people who work…”.↩︎

2. If you run into an error like “Argument 2 must have names” here, that means that the second thing, class, needs to have a name and doesn’t have one.↩︎

3. This was why we were doing discriminant analysis in the first place.↩︎

4. That sounds like the ultimate in evasiveness!↩︎

5. Every time I do this, I forget to put the summary around the outside, so I get the long ugly version of the output rather than the short pretty version.↩︎

6. For suitable definitions of fun.↩︎

7. Grammatically, I am supposed to write this as “two hundred and forty-four” in words, since I am not supposed to start a sentence with a number. But, I say, deal with it. Or, I suppose, “there are 244 people who work…”.↩︎

8. Until much later.↩︎

9. When you’re recording the data, you may find it convenient to use short codes to represent the possibly long factor levels, but in that case you should also use a codebook so that you know what the codes represent. When I read the data into R, I would create a factor with named levels, like I did here, if I don’t already have one.↩︎

10. I discovered that I used pp twice, and I want to use the first one again later, so I had to rename this one.↩︎

11. So we should do it, when assessing how good the classification is.↩︎

12. I have to learn to come up with better names.↩︎

13. str-replace is from stringr, and takes three inputs: a piece of text, the text to look for, and the text to replace it with. The piece of text in this case is one of the columns whose name starts with posterior; the dot represents it, that is to say “one of the columns whose name starts with posterior.↩︎

14. Rest assured that I did on the final exam!↩︎

15. The opposite of unite is separate, which splits a combined column like my combo into separate columns; it too uses underscore as the default separator.↩︎

16. You used to have to group them, but you don’t any more. Hence my old code has them grouped, but my new code does not.↩︎