`library(tidyverse)`

# 27 Logistic regression

## 27.1 Finding wolf spiders on the beach

A team of Japanese researchers were investigating what would cause the burrowing wolf spider *Lycosa ishikariana* to be found on a beach. They hypothesized that it had to do with the size of the grains of sand on the beach. They went to 28 beaches in Japan, measured the average sand grain size (in millimetres), and observed the presence or absence of this particular spider on each beach. The data are in link.

Why would logistic regression be better than regular regression in this case?

Read in the data and check that you have something sensible. (Look at the data file first: the columns are aligned.)

Make a boxplot of sand grain size according to whether the spider is present or absent. Does this suggest that sand grain size has something to do with presence or absence of the spider?

Fit a logistic regression predicting the presence or absence of spiders from the grain size, and obtain its

`summary`

. Note that you will need to do something with the response variable first.Is there a significant relationship between grain size and presence or absence of spiders at the \(\alpha=0.10\) level? Explain briefly.

Obtain predicted probabilities of spider presence for a representative collection of grain sizes. I only want predicted probabilities, not any kind of intervals.

Given your previous work in this question, does the trend you see in your predicted probabilities surprise you? Explain briefly.

## 27.2 Killing aphids

An experiment was designed to examine how well the insecticide rotenone kills aphids that feed on the chrysanthemum plant called *Macrosiphoniella sanborni*. The explanatory variable is the log concentration (in milligrams per litre) of the insecticide. At each of the five different concentrations, approximately 50 insects were exposed. The number of insects exposed at each concentration, and the number killed, are shown below.

```
Log-Concentration Insects exposed Number killed
0.96 50 6
1.33 48 16
1.63 46 24
2.04 49 42
2.32 50 44
```

Get these data into R. You can do this by copying the data into a file and reading that into R (creating a data frame), or you can enter the data manually into R using

`c`

, since there are not many values. In the latter case, you can create a data frame or not, as you wish. Demonstrate that you have the right data in R.* Looking at the data, would you expect there to be a significant effect of log-concentration? Explain briefly.

We are going to do a logistic regression to predict how likely an insect is to be killed, as it depends on the log-concentration. Create a suitable response variable, bearing in mind (i) that we have lots of insects exposed to each different concentration, and (ii) what needs to go into each column of the response.

Run a suitable logistic regression, and obtain a summary of the results.

Does your analysis support your answer to (here)? Explain briefly.

Obtain predicted probabilities of an insect’s being killed at each of the log-concentrations in the data set.

People in this kind of work are often interested in the “median lethal dose”. In this case, that would be the log-concentration of the insecticide that kills half the insects. Based on your predictions, roughly what do you think the median lethal dose is?

## 27.3 The effects of Substance A

In a dose-response experiment, animals (or cell cultures or human subjects) are exposed to some toxic substance, and we observe how many of them show some sort of response. In this experiment, a mysterious Substance A is exposed at various doses to 100 cells at each dose, and the number of cells at each dose that suffer damage is recorded. The doses were 10, 20, … 70 (mg), and the number of damaged cells out of 100 were respectively 10, 28, 53, 77, 91, 98, 99.

Find a way to get these data into R, and show that you have managed to do so successfully.

Would you expect to see a significant effect of dose for these data? Explain briefly.

Fit a logistic regression modelling the probability of a cell being damaged as it depends on dose. Display the results. (Comment on them is coming later.)

Does your output indicate that the probability of damage really does

*increase*with dose? (There are two things here: is there really an effect, and which way does it go?)Obtain predicted damage probabilities for some representative doses.

Draw a graph of the predicted probabilities, and to that add the observed proportions of damage at each dose. Hints: you will have to calculate the observed proportions first. See here, near the bottom, to find out how to add data to one of these graphs. The

`geom_point`

line is the one you need.

Looking at the predicted probabilities, would you say that the model fits well? Explain briefly.

## 27.4 What makes an animal get infected?

Some animals got infected with a parasite. We are interested in whether the likelihood of infection depends on any of the age, weight and sex of the animals. The data are at link. The values are separated by tabs.

Read in the data and take a look at the first few lines. Is this one animal per line, or are several animals with the same age, weight and sex (and infection status) combined onto one line? How can you tell?

* Make suitable plots or summaries of

`infected`

against each of the other variables. (You’ll have to think about`sex`

, um, you’ll have to think about the`sex`

variable, because it too is categorical.) Anything sensible is OK here. You might like to think back to what we did in Question here for inspiration. (You can also investigate`table`

, which does cross-tabulations.)Which, if any, of your explanatory variables appear to be related to

`infected`

? Explain briefly.Fit a logistic regression predicting

`infected`

from the other three variables. Display the`summary`

.* Which variables, if any, would you consider removing from the model? Explain briefly.

Are the conclusions you drew in (here) and (here) consistent, or not? Explain briefly.

* The first and third quartiles of

`age`

are 26 and 130; the first and third quartiles of`weight`

are 9 and 16. Obtain predicted probabilities for all combinations of these and`sex`

. (You’ll need to start by making a new data frame, using`datagrid`

to get all the combinations.)

## 27.5 The brain of a cat

A large number (315) of psychology students were asked to imagine that they were serving on a university ethics committee hearing a complaint against animal research being done by a member of the faculty. The students were told that the surgery consisted of implanting a device called a cannula in each cat’s brain, through which chemicals were introduced into the brain and the cats were then given psychological tests. At the end of the study, the cats’ brains were subjected to histological analysis. The complaint asked that the researcher’s authorization to carry out the study should be withdrawn, and the cats should be handed over to the animal rights group that filed the complaint. It was suggested that the research could just as well be done with computer simulations.

All of the psychology students in the survey were told all of this. In addition, they read a statement by the researcher that no animal felt much pain at any time, and that computer simulation was *not* an adequate substitute for animal research. Each student was also given *one* of the following scenarios that explained the benefit of the research:

“cosmetic”: testing the toxicity of chemicals to be used in new lines of hair care products.

“theory”: evaluating two competing theories about the function of a particular nucleus in the brain.

“meat”: testing a synthetic growth hormone said to potentially increase meat production.

“veterinary”: attempting to find a cure for a brain disease that is killing domesticated cats and endangered species of wild cats.

“medical”: evaluating a potential cure for a debilitating disease that afflicts many young adult humans.

Finally, each student completed two questionnaires: one that would assess their “relativism”: whether or not they believe in universal moral principles (low score) or whether they believed that the appropriate action depends on the person and situation (high score). The second questionnaire assessed “idealism”: a high score reflects a belief that ethical behaviour will always lead to good consequences (and thus that if a behaviour leads to any bad consequences at all, it is unethical).^{1}

After being exposed to all of that, each student stated their decision about whether the research should continue or stop.

I should perhaps stress at this point that no actual cats were harmed in the collection of these data (which can be found as a `.csv`

file at link). The variables in the data set are these:

`decision`

: whether the research should continue or stop (response)`idealism`

: score on idealism questionnaire`relativism`

: score on relativism questionnaire`gender`

of student`scenario`

of research benefits that the student read.

A more detailed discussion^(“[If you can believe it.] of this study is at link.

Read in the data and check by looking at the structure of your data frame that you have something sensible.

*Do not*call your data frame`decision`

, since that’s the name of one of the variables in it.Fit a logistic regression predicting

`decision`

from`gender`

. Is there an effect of gender?To investigate the effect (or non-effect) of

`gender`

, create a contingency table by feeding`decision`

and`gender`

into`table`

. What does this tell you?* Is your slope for

`gender`

in the previous logistic regression positive or negative? Is it applying to males or to females? Looking at the conclusions from your contingency table, what probability does that mean your logistic regression is actually modelling?Add the two variables

`idealism`

and`relativism`

to your logistic regression. Do either or both of them add significantly to your model? Explain briefly.Add the variable

`scenario`

to your model. That is, fit a new model with that variable plus all the others.Use

`anova`

to compare the models with and without`scenario`

. You’ll have to add a`test="Chisq"`

to your`anova`

, to make sure that the test gets done. Does`scenario`

make a difference or not, at \(\alpha=0.10\)? Explain briefly. (The reason we have to do it this way is that`scenario`

is a factor with five levels, so it has four slope coefficients. To test them all at once, which is what we need to make an overall test for`scenario`

, this is the way it has to be done.)Look at the

`summary`

of your model that contained`scenario`

. Bearing in mind that the slope coefficient for`scenariocosmetic`

is zero (since this is the first scenario alphabetically), which scenarios have the most positive and most negative slope coefficients? What does that tell you about those scenarios’ effects?Describe the effects that having (i) a higher idealism score and (ii) a higher relativity score have on a person’s probability of saying that the research should stop. Do each of these increase or decrease that probability? Explain briefly.

## 27.6 How not to get heart disease

What is associated with heart disease? In a study, a large number of variables were measured, as follows:

`age`

(years)`sex`

male or female`pain.type`

Chest pain type (4 values: typical angina, atypical angina, non-anginal pain, asymptomatic)`resting.bp`

Resting blood pressure, on admission to hospital`serum.chol`

Serum cholesterol`high.blood.sugar`

: greater than 120, yes or no`electro`

resting electrocardiographic results (normal, having ST-T, hypertrophy)`max.hr`

Maximum heart rate`angina`

Exercise induced angina (yes or no)`oldpeak`

ST depression induced by exercise relative to rest. See link.`slope`

Slope of peak exercise ST segment. Sloping up, flat or sloping down`colored`

number of major vessels (0–3) coloured by fluoroscopy`thal`

normal, fixed defect, reversible defect`heart.disease`

yes, no (response)

I don’t know what most of those are, but we will not let that stand in our way. Our aim is to find out what variables are associated with heart disease, and what values of those variables give high probabilities of heart disease being present. The data are in link.

Read in the data. Display the first few lines and convince yourself that those values are reasonable.

In a logistic regression, what probability will be predicted here? Explain briefly but convincingly. (Is each line of the data file one observation or a summary of several?)

* Fit a logistic regression predicting heart disease from everything else (if you have a column called

`X`

or`X1`

, ignore that), and display the results.Quite a lot of our explanatory variables are factors. To assess whether the factor as a whole should stay or can be removed, looking at the slopes won’t help us very much (since they tell us whether the other levels of the factor differ from the baseline, which may not be a sensible comparison to make). To assess which variables are candidates to be removed, factors included (properly), we can use

`drop1`

. Feed`drop1`

a fitted model and the words`test="Chisq"`

(take care of the capitalization!) and you’ll get a list of P-values. Which variable is the one that you would remove first? Explain briefly.I’m not going to make you do the whole backward elimination (I’m going to have you use

`step`

for that later), but do one step: that is, fit a model removing the variable you think should be removed, using`update`

, and then run`drop1`

again to see which variable will be removed next.Use

`step`

to do a backward elimination to find which variables have an effect on heart disease. Display your final model (which you can do by saving the output from`step`

in a variable, and asking for the summary of that. In`step`

, you’ll need to specify a starting model (the one from part (here)), the direction of elimination, and the test to display the P-value for (the same one as you used in`drop1`

). (Note: the actual decision to keep or drop explanatory variables is based on AIC rather than the P-value, with the result that`step`

will sometimes keep variables you would have dropped, with P-values around 0.10.)Display the summary of the model that came out of

`step`

.We are going to make a large number of predictions. Create and save a data frame that contains predictions for all combinations of representative values for all the variables in the model that came out of

`step`

. By “representative” I mean all the values for a categorical variable, and the five-number summary for a numeric variable. (Note that you will get a*lot*of predictions.)Find the largest predicted probability (which is the predicted probability of heart disease) and display all the variables that it was a prediction for.

Compare the

`summary`

of the final model from`step`

with your highest predicted heart disease probability and the values of the other variables that make it up. Are they consistent?

## 27.7 Successful breastfeeding

A regular pregnancy lasts 40 weeks, and a baby that is born at or before 33 weeks is called “premature”. The number of weeks at which a baby is born is called its “gestational age”. Premature babies are usually smaller than normal and may require special care. It is a good sign if, when the mother and baby leave the hospital to go home, the baby is successfully breastfeeding.

The data in link are from a study of 64 premature infants. There are three columns: the gestational age (a whole number of weeks), the number of babies of that gestational age that were successfully breastfeeding when they left the hospital, and the number that were not. (There were multiple babies of the same gestational age, so the 64 babies are summarized in 6 rows of data.)

Read the data into R and display the data frame.

Verify that there were indeed 64 infants, by having R do a suitable calculation on your data frame that gives the right answer for the right reason.

Do you think, looking at the data, that there is a relationship between gestational age and whether or not the baby was successfully breastfeeding when it left the hospital? Explain briefly.

Why is logistic regression a sensible technique to use here? Explain briefly.

Fit a logistic regression to predict the probability that an infant will be breastfeeding from its gestational age. Show the output from your logistic regression.

Does the significance or non-significance of the slope of

`gest.age`

surprise you? Explain briefly.Is your slope (in the

`Estimate`

column) positive or negative? What does that mean, in terms of gestational ages and breastfeeding? Explain briefly.Obtain the predicted probabilities that an infant will successfully breastfeed for a representative collection of gestational ages.

## 27.8 Making it over the mountains

In 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass, when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.

After the rescue, the age and gender of each person in the party was recorded, along with whether they survived or not. The data are in link.

Read in the data and display its structure. Does that agree with the description in the previous paragraph?

Make graphical or numerical summaries for each pair of variables. That is, you should make a graph or numerical summary for each of

`age`

vs.`gender`

,`age`

vs.

`survived`

and`gender`

vs.`survived`

. In choosing the kind of graph or summary that you will use, bear in mind that`survived`

and`gender`

are factors with two levels.For each of the three graphs or summaries in the previous question, what do they tell you about the relationship between the pair of variables concerned? Explain briefly.

Fit a logistic regression predicting survival from age and gender. Display the summary.

Do both explanatory variables have an impact on survival? Does that seem to be consistent with your numerical or graphical summaries? Explain briefly.

Are the men typically older, younger or about the same age as the women? Considering this, explain carefully what the negative

`gendermale`

slope in your logistic regression means.Obtain predicted probabilities of survival for each combination of some representative ages and of the two genders in this dataset.

Do your predictions support your conclusions from earlier about the effects of age and gender? Explain briefly.

## 27.9 Who needs the most intensive care?

The “APACHE II” is a scale for assessing patients who arrive in the intensive care unit (ICU) of a hospital. These are seriously ill patients who may die despite the ICU’s best attempts. APACHE stands for “Acute Physiology And Chronic Health Evaluation”.^{2} The scale score is calculated from several physiological measurements such as body temperature, heart rate and the Glasgow coma scale, as well as the patient’s age. The final result is a score between 0 and 71, with a higher score indicating more severe health issues. Is it true that a patient with a higher APACHE II score has a higher probability of dying?

Data from one hospital are in link. The columns are: the APACHE II score, the total number of patients who had that score, and the number of patients with that score who died.

Read in and display the data (however much of it displays). Why are you convinced that have the right thing?

Does each row of the data frame relate to one patient or sometimes to more than one? Explain briefly.

Explain why this is the kind of situation where you need a two-column response, and create this response variable, bearing in mind that I will (later) want you to estimate the probability of dying, given the

`apache`

score.Fit a logistic regression to estimate the probability of death from the

`apache`

score, and display the results.Is there a significant effect of

`apache`

score on the probability of survival? Explain briefly.Is the effect of a larger

`apache`

score to increase or to decrease the probability of death? Explain briefly.Obtain the predicted probability of death for a representative collection of the

`apache`

scores that were in the data set. Do your predictions behave as you would expect (from your earlier work)?Make a plot of predicted death probability against

`apache`

score (joined by lines) with, also on the plot, the observed proportion of deaths within each`apache`

score, plotted against`apache`

score. Does there seem to be good agreement between observation and prediction? Hint: calculate what you need to from the original dataframe first, save it, then make a plot of the predictions, and then to the plot add the appropriate thing from the dataframe you just saved.

## 27.10 Go away and don’t come back!

When a person has a heart attack and survives it, the major concern of health professionals is to prevent the person having a second heart attack. Two factors that are believed to be important are anger and anxiety; if a heart attack survivor tends to be angry or anxious, they are believed to put themselves at increased risk of a second heart attack.

Twenty heart attack survivors took part in a study. Each one was given a test designed to assess their anxiety (a higher score on the test indicates higher anxiety), and some of the survivors took an anger management course. The data are in link; `y`

and `n`

denote “yes” and “no” respectively. The columns denote (i) whether or not the person had a second heart attack, (ii) whether or not the person took the anger management class, (iii) the anxiety score.

Read in and display the data.

* Fit a logistic regression predicting whether or not a heart attack survivor has a second heart attack, as it depends on anxiety score and whether or not the person took the anger management class. Display the results.

In the previous part, how can you tell that you were predicting the probability of having a second heart attack (as opposed to the probability of not having one)?

* For the two possible values

`y`

and`n`

of`anger`

and the anxiety scores 40, 50 and 60, make a data frame containing all six combinations, and use it to obtain predicted probabilities of a second heart attack. Display your predicted probabilities side by side with what they are predictions for.Use your predictions from the previous part to describe the effect of changes in

`anxiety`

and`anger`

on the probability of a second heart attack.Are the effects you described in the previous part consistent with the

`summary`

output from`glm`

that you obtained in (here)? Explain briefly how they are, or are not. (You need an explanation for each of`anxiety`

and`anger`

, and you will probably get confused if you look at the P-values, so don’t.)

My solutions follow:

## 27.11 Finding wolf spiders on the beach

A team of Japanese researchers were investigating what would cause the burrowing wolf spider *Lycosa ishikariana* to be found on a beach. They hypothesized that it had to do with the size of the grains of sand on the beach. They went to 28 beaches in Japan, measured the average sand grain size (in millimetres), and observed the presence or absence of this particular spider on each beach. The data are in link.

- Why would logistic regression be better than regular regression in this case?

Solution

Because the response variable, whether or not the spider is present, is a categorical yes/no success/failure kind of variable rather than a quantitative numerical one, and when you have this kind of response variable, this is when you want to use logistic regression.

\(\blacksquare\)

- Read in the data and check that you have something sensible. (Look at the data file first: the columns are aligned.)

Solution

The nature of the file means that you need `read_table`

:

```
<- "http://ritsokiguess.site/datafiles/spiders.txt"
my_url <- read_table(my_url) spider
```

```
── Column specification ────────────────────────────────────────────────────────
cols(
Grain.size = col_double(),
Spiders = col_character()
)
```

` spider`

We have a numerical sand grain size and a categorical variable `Spiders`

that indicates whether the spider was present or absent. As we were expecting. (The categorical variable is actually text or `chr`

, which will matter in a minute.)

\(\blacksquare\)

- Make a boxplot of sand grain size according to whether the spider is present or absent. Does this suggest that sand grain size has something to do with presence or absence of the spider?

Solution

`ggplot(spider, aes(x = Spiders, y = Grain.size)) + geom_boxplot()`

The story seems to be that when spiders are present, the sand grain size tends to be larger. So we would expect to find some kind of useful relationship in the logistic regression.

Note that we have reversed the cause and effect here: in the boxplot we are asking “given that the spider is present or absent, how big are the grains of sand?”, whereas the logistic regression is asking “given the size of the grains of sand, how likely is the spider to be present?”. But if one variable has to do with the other, we would hope to see the link either way around.

\(\blacksquare\)

- Fit a logistic regression predicting the presence or absence of spiders from the grain size, and obtain its
`summary`

. Note that you will need to do something with the response variable first.

Solution

The presence or absence of spiders is our response. This is actually text at the moment:

` spider`

so we need to make a factor version of it first. I’m going to live on the edge and overwrite everything:

```
<- spider %>% mutate(Spiders = factor(Spiders))
spider spider
```

`Spiders`

is now a factor, correctly. (Sometimes a text variable gets treated as a factor, sometimes it needs to be an explicit `factor`

. This is one of those times.) Now we go ahead and fit the model. I’m naming this as response-with-a-number, so I still have the Capital Letter. Any choice of name is good, though.

```
.1 <- glm(Spiders ~ Grain.size, family = "binomial", data = spider)
Spiderssummary(Spiders.1)
```

```
Call:
glm(formula = Spiders ~ Grain.size, family = "binomial", data = spider)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.648 1.354 -1.217 0.2237
Grain.size 5.122 3.006 1.704 0.0884 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35.165 on 27 degrees of freedom
Residual deviance: 30.632 on 26 degrees of freedom
AIC: 34.632
Number of Fisher Scoring iterations: 5
```

\(\blacksquare\)

- Is there a significant relationship between grain size and presence or absence of spiders at the \(\alpha=0.10\) level? Explain briefly.

Solution

The P-value on the `Grain.size`

line is *just* less than 0.10 (it is 0.088), so there is *just* a significant relationship. It isn’t a very strong significance, though. This might be because we don’t have that much data: even though we have 28 observations, which, on the face of it, is not a very small sample size, each observation doesn’t tell us much: only whether the spider was found on that beach or not. Typical sample sizes in logistic regression are in the hundreds — the same as for opinion polls, because you’re dealing with the same kind of data. The weak significance here lends some kind of weak support to the researchers’ hypothesis, but I’m sure they were hoping for something better.

\(\blacksquare\)

- Obtain predicted probabilities of spider presence for a representative collection of grain sizes. I only want predicted probabilities, not any kind of intervals.

Solution

Make a data frame of values to predict from directly, using `tribble`

or for that matter `datagrid`

. For some reason, I chose these four values:

```
<- tribble(
new ~Grain.size,
0.2,
0.5,
0.8,
1.1
) new
```

and then

`cbind(predictions(Spiders.1, newdata = new))`

One of the above is all I need. I prefer the first one, since that way we don’t even have to decide what th e representative values are.

Extra:

Note that the probabilities don’t go up linearly. (If they did, they would soon get bigger than 1!). It’s actually the *log-odds* that go up linearly.

To verify this, you can add a `type`

to the `predictions`

:

`cbind(predictions(Spiders.1, newdata = new, type = "link"))`

This one, as you see shortly, makes more sense with equally-spaced grain sizes.

The meaning of that slope coefficient in the `summary`

, which is about 5, is that if you increase grain size by 1, you increase the log-odds by 5. In the table above, we increased the grain size by 0.3 each time, so we would expect to increase the log-odds by \((0.3)(5)=1.5\), which is (to this accuracy) what happened.

Log-odds are hard to interpret. Odds are a bit easier, and to get them, we have to `exp`

the log-odds:

```
cbind(predictions(Spiders.1, newdata = new, type = "link")) %>%
mutate(exp_pred = exp(estimate))
```

Thus, with each step of 0.3 in grain size, we *multiply* the odds of finding a spider by about

`exp(1.5)`

`[1] 4.481689`

or about 4.5 times. If you’re a gambler, this might give you a feel for how large the effect of grain size is. Or, of course, you can just look at the probabilities.

\(\blacksquare\)

- Given your previous work in this question, does the trend you see in your predicted probabilities surprise you? Explain briefly.

Solution

My predicted probabilities go up as grain size goes up. This fails to surprise me for a couple of reasons: first, in my boxplots, I saw that grain size tended to be larger when spiders were present, and second, in my logistic regression `summary`

, the slope was positive, so likewise spiders should be more likely as grain size goes up. Observing just one of these things is enough, though of course it’s nice if you can spot both.

\(\blacksquare\)

## 27.12 Killing aphids

An experiment was designed to examine how well the insecticide rotenone kills aphids that feed on the chrysanthemum plant called *Macrosiphoniella sanborni*. The explanatory variable is the log concentration (in milligrams per litre) of the insecticide. At each of the five different concentrations, approximately 50 insects were exposed. The number of insects exposed at each concentration, and the number killed, are shown below.

```
Log-Concentration Insects exposed Number killed
0.96 50 6
1.33 48 16
1.63 46 24
2.04 49 42
2.32 50 44
```

- Get these data into R. You can do this by copying the data into a file and reading that into R (creating a data frame), or you can enter the data manually into R using
`c`

, since there are not many values. In the latter case, you can create a data frame or not, as you wish. Demonstrate that you have the right data in R.

Solution

There are a couple of ways. My current favourite is the `tidyverse`

-approved `tribble`

method. A `tribble`

is a “transposed `tibble`

”, in which you copy and paste the data, inserting column headings and commas in the right places. The columns don’t have to line up, since it’s the commas that determine where one value ends and the next one begins:

```
<- tribble(
dead_bugs ~log_conc, ~exposed, ~killed,
0.96, 50, 6,
1.33, 48, 16,
1.63, 46, 24,
2.04, 49, 42,
2.32, 50, 44
) dead_bugs
```

Note that the last data value has no comma after it, but instead has the closing bracket of `tribble`

.

You can have extra spaces if you wish. They will just be ignored. If you are clever in R Studio, you can insert a column of commas all at once (using “multiple cursors”). I used to do it like this. I make vectors of each column using `c`

and then glue the columns together into a data frame:

```
<- c(0.96, 1.33, 1.63, 2.04, 2.32)
log_conc <- c(50, 48, 46, 49, 50)
exposed <- c(6, 16, 24, 42, 44)
killed <- tibble(log_conc, exposed, killed)
dead_bugs2 dead_bugs2
```

The values are correct — I checked them.

Now you see why `tribble`

stands for “transposed tibble”: if you want to construct a data frame by hand, you have to work with columns and then glue them together, but `tribble`

allows you to work “row-wise” with the data as you would lay it out on the page.

The other obvious way to read the data values without typing them is to copy them into a file and read *that*. The values as laid out are aligned in columns. They might be separated by tabs, but they are not. (It’s hard to tell without investigating, though a tab is by default eight spaces and these values look farther apart than that.) I copied them into a file `exposed.txt`

in my current folder (or use `file.choose`

):

`<- read_table("exposed.txt") bugs2 `

`Warning: Missing column names filled in: 'X6' [6]`

```
── Column specification ────────────────────────────────────────────────────────
cols(
`Log-Concentration` = col_double(),
Insects = col_double(),
exposed = col_double(),
Number = col_logical(),
killed = col_character(),
X6 = col_character()
)
```

```
Warning: 5 parsing failures.
row col expected actual file
1 -- 6 columns 4 columns 'exposed.txt'
2 -- 6 columns 4 columns 'exposed.txt'
3 -- 6 columns 4 columns 'exposed.txt'
4 -- 6 columns 4 columns 'exposed.txt'
5 -- 6 columns 4 columns 'exposed.txt'
```

` bugs2`

This didn’t quite work: the last column `Number killed`

got split into two, with the actual number killed landing up in `Number`

and the column `killed`

being empty. If you look at the data file, the data values for `Number killed`

are actually aligned with the word `Number`

, which is why it came out this way. Also, you’ll note, the column names have those “backticks” around them, because they contain illegal characters like a minus sign and spaces. Perhaps a good way to pre-empt^{3} all these problems is to make a copy of the data file with the illegal characters replaced by underscores, which is my file `exposed2.txt`

:

`<- read_table("exposed2.txt") bugs2 `

```
── Column specification ────────────────────────────────────────────────────────
cols(
Log_Concentration = col_double(),
Insects_exposed = col_double(),
Number_killed = col_double()
)
```

` bugs2`

I don’t know why a ficitious fourth column was invented, but we can safely ignore it the rest of the way. We have all our actual data. We’d have to be careful with Capital Letters this way, but it’s definitely good.

You may have thought that this was a lot of fuss to make about reading in data, but the point is that data can come your way in lots of different forms, and you need to be able to handle whatever you receive so that you can do some analysis with it.

\(\blacksquare\)

- * Looking at the data, would you expect there to be a significant effect of log-concentration? Explain briefly.

Solution

The numbers of insects killed goes up *sharply* as the concentration increases, while the numbers of insects exposed don’t change much. So I would expect to see a strong, positive effect of concentration, and I would expect it to be strongly significant, especially since we have almost 250 insects altogether.

\(\blacksquare\)

- We are going to do a logistic regression to predict how likely an insect is to be killed, as it depends on the log-concentration. Create a suitable response variable, bearing in mind (i) that we have lots of insects exposed to each different concentration, and (ii) what needs to go into each column of the response.

Solution

There needs to be a two-column response variable. The first column needs to be the number of “successes” (insects killed, here) and the second needs to be the number of “failures” (insects that survived). We don’t actually have the latter, but we know how many insects were exposed in total to each dose, so we can work it out. Like this:

```
%>%
dead_bugs mutate(survived = exposed - killed) %>%
select(killed, survived) %>%
as.matrix() -> response
response
```

```
killed survived
[1,] 6 44
[2,] 16 32
[3,] 24 22
[4,] 42 7
[5,] 44 6
```

`glm`

requires an R `matrix`

rather than a data frame, so the last stage of our pipeline is to create one (using the same numbers that are in the data frame: all the `as.`

functions do is to change what type of thing it is, without changing its contents).

It’s also equally good to create the response *outside* of the data frame and use `cbind`

to glue its columns together:

```
<- with(
resp2
dead_bugs,cbind(killed, survived = exposed - killed)
) resp2
```

```
killed survived
[1,] 6 44
[2,] 16 32
[3,] 24 22
[4,] 42 7
[5,] 44 6
```

\(\blacksquare\)

- Run a suitable logistic regression, and obtain a summary of the results.

Solution

I think you know how to do this by now:

```
.1 <- glm(response ~ log_conc, family = "binomial", data = dead_bugs)
bugssummary(bugs.1)
```

```
Call:
glm(formula = response ~ log_conc, family = "binomial", data = dead_bugs)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8923 0.6426 -7.613 2.67e-14 ***
log_conc 3.1088 0.3879 8.015 1.11e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 96.6881 on 4 degrees of freedom
Residual deviance: 1.4542 on 3 degrees of freedom
AIC: 24.675
Number of Fisher Scoring iterations: 4
```

\(\blacksquare\)

- Does your analysis support your answer to (here)? Explain briefly.

Solution

That’s a *very* small P-value, \(1.1\times 10^{-15}\), on `log_conc`

, so there is no doubt that concentration has an effect on an insect’s chances of being killed. This is exactly what I guessed in (here), which I did before looking at the results, honest!

\(\blacksquare\)

- Obtain predicted probabilities of an insect’s being killed at each of the log-concentrations in the data set.

Solution

Use a `newdata`

that is the original dataframe:

`cbind(predictions(bugs.1, newdata = dead_bugs))`

The advantage of this is that you can compare the observed with the predicted. For example, 44 out of 50 insects were killed at log-dose 2.32, which is a proportion of 0.88, pretty close to the prediction of 0.91.

Extra: you could also make a plot of these. The normal thing is to use `plot_cap`

, but this time, the response variable is that matrix thing outside the dataframe, which confuses the issue. So let’s make a more detailed set of predictions and plot them ourselves (effectively, doing the same thing `plot_cap`

does but building it ourselves):

```
<- tibble(log_conc = seq(0.5, 2.5, 0.01))
new cbind(predictions(bugs.1, newdata = new)) %>%
select(log_conc, estimate, conf.low, conf.high) %>%
ggplot(aes(x = log_conc, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_line() + geom_ribbon(alpha = 0.3)
```

As the log-concentration goes up, you can see how clearly the probability of the aphid being killed goes up. The confidence band is narrow because there is lots of data (almost 250 aphids altogether).

The final `alpha = 0.3`

makes the ribbon see-through, so that you can see the actual predictions behind it. A smaller value of `alpha`

makes it more transparent, but I think this strikes a decent compromise between being able to clearly see both the predictions and the confidence limits.

\(\blacksquare\)

- People in this kind of work are often interested in the “median lethal dose”. In this case, that would be the log-concentration of the insecticide that kills half the insects. Based on your predictions, roughly what do you think the median lethal dose is?

Solution

The log-concentration of 1.63 is predicted to kill just over half the insects, so the median lethal dose should be a bit less than 1.63. It should not be as small as 1.33, though, since that log-concentration only kills less than a third of the insects. So I would guess somewhere a bit bigger than 1.5. Any guess somewhere in this ballpark is fine: you really cannot be very accurate.

If you read through the Extra to the previous part (or at least looked at the graph), the median lethal dose is where the curve of predictions passes through 0.5 on the \(y\)-axis. This is at a log-concentration of just less than 1.6; if you judge from the scale where the crossing-point is, it’s something like 1.57.

Extra: this is kind of a strange prediction problem, because we know what the *response* variable should be, and we want to know what the explanatory variable’s value is. Normally we do predictions the other way around.^{4} So the only way to get a more accurate figure is to try some different log-concentrations, and see which one gets closest to a probability 0.5 of killing the insect.

Something like this would work:

```
<- datagrid(model = bugs.1, log_conc = seq(1.5, 1.63, 0.01))
new cbind(predictions(bugs.1, newdata = new))
```

The closest one of these to a probability of 0.5 is 0.4971, which goes with a log-concentration of 1.57: indeed, a bit bigger than 1.5 and a bit less than 1.63, and close to what I read off from my graph. The `seq`

in the construction of the new data frame is “fill sequence”: go from 1.5 to 1.63 in steps of 0.01. We are predicting for values of `log_conc`

that we chose, so the way to go is to make a new dataframe with `datagrid`

and then feed that into `predictions`

with `newdata`

.

Now, of course this is only our “best guess”, like a single-number prediction in regression. There is uncertainty attached to it (because the actual logistic regression might be different from the one we estimated), so we ought to provide a confidence interval for it. But I’m riding the bus as I type this, so I can’t look it up right now.

Later: there’s a function called `dose.p`

in `MASS`

that appears to do this:

```
<- dose.p(bugs.1)
lethal lethal
```

```
Dose SE
p = 0.5: 1.573717 0.05159576
```

We have a sensible point estimate (the same 1.57 that we got by hand), and we have a standard error, so we can make a confidence interval by going up and down twice it (or 1.96 times it) from the estimate. The structure of the result is a bit arcane, though:

`str(lethal)`

```
'glm.dose' Named num 1.57
- attr(*, "names")= chr "p = 0.5:"
- attr(*, "SE")= num [1, 1] 0.0516
..- attr(*, "dimnames")=List of 2
.. ..$ : chr "p = 0.5:"
.. ..$ : NULL
- attr(*, "p")= num 0.5
```

It is what R calls a “vector with attributes”. To get at the pieces and calculate the interval, we have to do something like this:

`<- as.numeric(lethal)) (lethal_est `

`[1] 1.573717`

`<- as.vector(attr(lethal, "SE"))) (lethal_SE `

`[1] 0.05159576`

and then make the interval:

`+ c(-2, 2) * lethal_SE lethal_est `

`[1] 1.470526 1.676909`

1.47 to 1.68.

I got this idea from page 4.14 of link. I think I got a little further than he did. An idea that works more generally is to get several intervals all at once, say for the “quartile lethal doses” as well:

```
<- dose.p(bugs.1, p = c(0.25, 0.5, 0.75))
lethal lethal
```

```
Dose SE
p = 0.25: 1.220327 0.07032465
p = 0.50: 1.573717 0.05159576
p = 0.75: 1.927108 0.06532356
```

This looks like a data frame or matrix, but is actually a “named vector”, so `enframe`

will get at least some of this and turn it into a genuine data frame:

`enframe(lethal)`

That doesn’t get the SEs, so we’ll make a new column by grabbing the “attribute” as above:

`enframe(lethal) %>% mutate(SE = attr(lethal, "SE"))`

and now we make the intervals by making new columns containing the lower and upper limits:

```
enframe(lethal) %>%
mutate(SE = attr(lethal, "SE")) %>%
mutate(LCL = value - 2 * SE, UCL = value + 2 * SE)
```

Now we have intervals for the median lethal dose, as well as for the doses that kill a quarter and three quarters of the aphids.

\(\blacksquare\)

## 27.13 The effects of Substance A

In a dose-response experiment, animals (or cell cultures or human subjects) are exposed to some toxic substance, and we observe how many of them show some sort of response. In this experiment, a mysterious Substance A is exposed at various doses to 100 cells at each dose, and the number of cells at each dose that suffer damage is recorded. The doses were 10, 20, … 70 (mg), and the number of damaged cells out of 100 were respectively 10, 28, 53, 77, 91, 98, 99.

- Find a way to get these data into R, and show that you have managed to do so successfully.

Solution

There’s not much data here, so we don’t need to create a file, although you can do so if you like (in the obvious way: type the doses and damaged cell numbers into a `.txt`

file or spreadsheet and read in with the appropriate `read_`

function). Or, use a `tribble`

:

```
<- tribble(
dr ~dose, ~damaged,
10, 10,
20, 28,
30, 53,
40, 77,
50, 91,
60, 98,
70, 99
) dr
```

Or, make a data frame with the values typed in:

```
<- tibble(
dr2 dose = seq(10, 70, 10),
damaged = c(10, 28, 53, 77, 91, 98, 99)
) dr2
```

`seq`

fills a sequence “10 to 70 in steps of 10”, or you can just list the doses.

I like this better than making two columns *not* attached to a data frame, but that will work as well.

For these, find a way you like, and stick with it.

\(\blacksquare\)

- Would you expect to see a significant effect of dose for these data? Explain briefly.

Solution

The number of damaged cells goes up sharply as the dose goes up (from a very small number to almost all of them). So I’d expect to see a strongly significant effect of dose. This is far from something that would happen by chance.

\(\blacksquare\)

- Fit a logistic regression modelling the probability of a cell being damaged as it depends on dose. Display the results. (Comment on them is coming later.)

Solution

This has a bunch of observations at each dose (100 cells, in fact), so we need to do the two-column response thing: the successes in the first column and the failures in the second. This means that we first need to calculate the number of cells at each dose that were not damaged, by subtracting the number that *were* damaged from 100. R makes this easier than you’d think, as you see. A `mutate`

is the way to go: create a new column in `dr`

and save back in `dr`

(because I like living on the edge):

```
<- dr %>% mutate(undamaged = 100 - damaged)
dr dr
```

The programmer in you is probably complaining “but, 100 is a number and `damaged`

is a vector of 7 numbers. How does R know to subtract *each one* from 100?” Well, R has what’s known as “recycling rules”: if you try to add or subtract (or elementwise multiply or divide) two vectors of different lengths, it recycles the smaller one by repeating it until it’s as long as the longer one. So rather than `100-damaged`

giving an error, it does what you want.^{5}

I took the risk of saving the new data frame over the old one. If it had failed for some reason, I could have started again.

Now we have to make our response “matrix” with two columns, using `cbind`

:

```
<- with(dr, cbind(damaged, undamaged))
response response
```

```
damaged undamaged
[1,] 10 90
[2,] 28 72
[3,] 53 47
[4,] 77 23
[5,] 91 9
[6,] 98 2
[7,] 99 1
```

Note that each row adds up to 100, since there were 100 cells at each dose. This is actually trickier than it looks: the two things in `cbind`

are columns (vectors), and `cbind`

glues them together to make an R `matrix`

:

`class(response)`

`[1] "matrix" "array" `

which is what `glm`

needs here, even though it looks a lot like a data frame (which wouldn’t work here). This *would* be a data frame:

`%>% select(damaged, undamaged) %>% class() dr `

`[1] "tbl_df" "tbl" "data.frame"`

and would therefore be the wrong thing to give `glm`

. So I had to do it with `cbind`

, or use some other trickery, like this:

```
%>%
dr select(damaged, undamaged) %>%
as.matrix() -> resp
class(resp)
```

`[1] "matrix" "array" `

Now we fit our model:

```
.1 <- glm(response ~ dose, family = "binomial", data = dr)
cellssummary(cells.1)
```

```
Call:
glm(formula = response ~ dose, family = "binomial", data = dr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.275364 0.278479 -11.76 <2e-16 ***
dose 0.113323 0.008315 13.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 384.13349 on 6 degrees of freedom
Residual deviance: 0.50428 on 5 degrees of freedom
AIC: 31.725
Number of Fisher Scoring iterations: 4
```

\(\blacksquare\)

- Does your output indicate that the probability of damage really does
*increase*with dose? (There are two things here: is there really an effect, and which way does it go?)

Solution

The slope of `dose`

is significantly nonzero (P-value less than \(2.2 \times 10^{-16}\), which is as small as it can be). The slope itself is *positive*, which means that as dose goes up, the probability of damage goes up. That’s all I needed, but if you want to press on: the slope is 0.113, so an increase of 1 in dose goes with an increase of 0.113 in the *log-odds* of damage. Or it multiplies the odds of damage by \(\exp(0.113)\). Since 0.113 is small, this is about \(1.113\) (mathematically, \(e^x\simeq 1+x\) if \(x\) is small), so that the increase is about 11%. If you like, you can get a rough CI for the slope by going up and down twice its standard error (this is the usual approximately-normal thing). Here that would be

`0.113323 + c(-2, 2) * 0.008315`

`[1] 0.096693 0.129953`

I thought about doing that in my head, but thought better of it, since I have R just sitting here. The interval is short (we have lots of data) and definitely does not contain zero.

\(\blacksquare\)

- Obtain predicted damage probabilities for some representative doses.

Solution

Pick some representative doses first, and then use them as `newdata`

. The ones in the original data are fine (there are only seven of them):

```
<- cbind(predictions(cells.1, newdata = dr))
p p
```

I saved mine to use again later, but you don’t have to unless *you* want to use your predictions again later.

\(\blacksquare\)

- Draw a graph of the predicted probabilities, and to that add the observed proportions of damage at each dose. Hints: you will have to calculate the observed proportions first. See here, near the bottom, to find out how to add data to one of these graphs. The
`geom_point`

line is the one you need.

Looking at the predicted probabilities, would you say that the model fits well? Explain briefly.

Solution

This ought to be based on `plot_cap`

, but that doesn’t work here because of the `response`

that’s not part of the dataframe. So we will be making this ourselves. Let’s start with a plot of the predictions, using the prediction we did just now:

```
ggplot(p, aes(x = dose, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_line() + geom_ribbon(alpha = 0.3)
```

This is not a smooth curve like the ones `plot_cap`

makes, but that’s all right, because we want to compare the predictions with the data.

Let’s take a look at our original dataframe:

` dr`

To that we need to add a column of proportion damaged, which is `damaged`

divided by `damaged`

plus `undamaged`

. This last ought to be 100, but data can go missing for any number of reasons, so it pays not to assume that they add up to 100 every time:

```
%>% mutate(prop = damaged / (damaged + undamaged)) -> dr2
dr dr2
```

Check. I saved this to add to the graph later.

Now you can add a `geom_point`

with a `data =`

and an `aes`

, making the points red, except that the obvious doesn’t quite work:

```
ggplot(p, aes(x = dose, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_line() + geom_ribbon(alpha = 0.3) +
geom_point(data = dr2, aes(x = dose, y = prop), colour = "red")
```

```
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 3rd layer.
Caused by error:
! object 'conf.low' not found
```

The message is not very helpful, but I can tell you where it comes from. When you add something to a plot like this, all the things in the original `ggplot`

are “inherited” by anything else that you add to the plot, so that you either have to overwrite them with something new (as I did with `x`

and `y`

) or you get the previous values, one of which was evidently `conf.low`

(which supplied a value for `ymin`

), but the dataframe `dr2`

doesn’t have a `conf.low`

in it. To override this behaviour, which we don’t want because *we* have nothing called `conf.low`

in our data, add `inherit.aes = FALSE`

to the `geom_point`

:^{6}

```
ggplot(p, aes(x = dose, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_line() + geom_ribbon(alpha = 0.3) +
geom_point(data = dr2, aes(x = dose, y = prop), colour = "red", inherit.aes = FALSE)
```

I also made the data points red (you don’t need to, but if you want to, put the colour *outside* the `aes`

, to make it clear that the colour is constant, not determined by any of the variables in your dataframe).

The predicted probabilities ought to be close to the observed proportions. They are in fact *very* close to them, so the model fits very well indeed.

Your actual words are a judgement call, so precisely what you say doesn’t matter so much, but *I* think this model fit is actually closer than you could even hope to expect, it’s that good. But, your call. I think your answer ought to contain “close” or “fits well” at the very least.

\(\blacksquare\)

## 27.14 What makes an animal get infected?

Some animals got infected with a parasite. We are interested in whether the likelihood of infection depends on any of the age, weight and sex of the animals. The data are at link. The values are separated by tabs.

- Read in the data and take a look at the first few lines. Is this one animal per line, or are several animals with the same age, weight and sex (and infection status) combined onto one line? How can you tell?

Solution

The usual beginnings, bearing in mind the data layout:

```
<- "http://ritsokiguess.site/datafiles/infection.txt"
my_url <- read_tsv(my_url) infect
```

```
Rows: 81 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): infected, sex
dbl (2): age, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` infect`

Success. This appears to be one animal per line, since there is no indication of frequency (of “how many”). If you were working as a consultant with somebody’s data, this would be a good thing to confirm with them before you went any further.

You can check a few more lines to convince yourself and the story is much the same. The other hint is that you have actual categories of response, which usually indicates one individual per row, but not always. If it doesn’t, you have some extra work to do to bash it into the right format.

Extra: let’s see whether we can come up with an example of that. I’ll make a smaller example, and perhaps the place to start is “all possible combinations” of a few things. `crossing`

is the same idea as `datagrid`

, except that it doesn’t need a model, and so is better for building things from scratch:

```
<- crossing(age = c(10, 12), gender = c("f", "m"), infected = c("y", "n"))
d d
```

These might be one individual per row, or they might be more than one, as they would be if we have a column of frequencies:

```
<- d %>% mutate(freq = c(12, 19, 17, 11, 18, 26, 16, 8))
d d
```

Now, these are multiple observations per row (the presence of frequencies means there’s no doubt about that), but the format is wrong: `infected`

is my response variable, and we want the frequencies of `infected`

being `y`

or `n`

in *separate* columns — that is, we have to *untidy* the data a bit to make it suitable for modelling. This is `pivot_wider`

, the opposite of `pivot_longer`

:

`%>% pivot_wider(names_from=infected, values_from=freq) d `

Now you can pull out the columns `y`

and `n`

and make them into your response, and predict that from `age`

and `gender`

.

The moral of this story is that if you are going to have multiple observations per row, you probably want the combinations of *explanatory* variables one per row, but you want the categories of the *response* variable in separate columns.

Back to where we were the rest of the way.

\(\blacksquare\)

- * Make suitable plots or summaries of
`infected`

against each of the other variables. (You’ll have to think about`sex`

, um, you’ll have to think about the`sex`

variable, because it too is categorical.) Anything sensible is OK here. You might like to think back to what we did in Question here for inspiration. (You can also investigate`table`

, which does cross-tabulations.)

Solution

What comes to my mind for the numerical variables `age`

and `weight`

is boxplots:

`ggplot(infect, aes(x = infected, y = age)) + geom_boxplot()`

`ggplot(infect, aes(x = infected, y = weight)) + geom_boxplot()`

The variables `sex`

and `infected`

are both categorical. I guess a good plot for those would be some kind of grouped bar plot, which I have to think about. So let’s first try a numerical summary, a cross-tabulation, which is gotten via `table`

:

`with(infect, table(sex, infected))`

```
infected
sex absent present
female 17 11
male 47 6
```

Or, if you like the `tidyverse`

:

`%>% count(sex, infected) infect `

Now, bar plots. Let’s start with one variable. The basic bar plot has categories of a categorical variable along the \(x\)-axis and each bar is a count of how many observations were in that category. What is nice about `geom_bar`

is that it will do the counting for you, so that the plot is just this:

`ggplot(infect, aes(x = sex)) + geom_bar()`

There are about twice as many males as females.

You may think that this looks like a histogram, which it almost does, but there is an important difference: the kind of variable on the \(x\)-axis. Here, it is a categorical variable, and you count how many observations fall in each category (at least, `ggplot`

does). On a histogram, the \(x\)-axis variable is a continuous *numerical* one, like height or weight, and you have to *chop it up* into intervals (and then you count how many observations are in each chopped-up interval).

Technically, on a bar plot, the bars have a little gap between them (as here), whereas the histogram bars are right next to each other, because the right side of one histogram bar is the left side of the next.

All right, two categorical variables. The idea is that you have each bar divided into sub-bars based on the frequencies of a second variable, which is specified by `fill`

. Here’s the basic idea:

`ggplot(infect, aes(x = sex, fill = infected)) + geom_bar()`

This is known in the business as a “stacked bar chart”. The issue is how much of each bar is blue, which is unnecessarily hard to judge because the male bar is taller. (Here, it is not so bad, because the amount of blue in the male bar is smaller and the bar is also taller. But we got lucky here.)

There are two ways to improve this. One is known as a “grouped bar chart”, which goes like this:

```
ggplot(infect, aes(x = sex, fill = infected)) +
geom_bar(position = "dodge")
```

The absent and present frequencies for females are next to each other, and the same for males, and you can read off how big they are from the \(y\)-scale. This is my preferred graph for two (or more than two) categorical variables.

You could switch the roles of `sex`

and `infected`

and get a different chart, but one that conveys the same information. Try it. (The reason for doing it the way around I did is that being infected or not is the response and `sex`

is explanatory, so that on my plot you can ask “out of the males, how many were infected?”, which is the way around that makes sense.)

The second way is to go back to stacked bars, but make them the same height, so you can compare the fractions of the bars that are each colour. This is `position="fill"`

:

```
ggplot(infect, aes(x = sex, fill = infected)) +
geom_bar(position = "fill")
```

This also shows that more of the females were infected than the males, but without getting sidetracked into the issue that there were more males to begin with.

I wrote this question in early 2017. At that time, I wrote:

I learned about this one approximately two hours ago. I just ordered Hadley Wickham’s new book “R for Data Science” from Amazon, and it arrived today. It’s in there. (A good read, by the way. I’m thinking of using it as a recommended text next year.) As is so often the way with

`ggplot`

, the final answer looks very simple, but there is a lot of thinking required to get there, and you end up having even more respect for Hadley Wickham for the clarity of thinking that enabled this to be specified in a simple way.

(end quote)

\(\blacksquare\)

- Which, if any, of your explanatory variables appear to be related to
`infected`

? Explain briefly.

Solution

Let’s go through our output from (here). In terms of `age`

, when infection is present, animals are (slightly) older. So there might be a small age effect. Next, when infection is present, weight is typically a *lot* less. So there ought to be a big weight effect. Finally, from the table, females are somewhere around 50-50 infected or not, but very few males are infected. So there ought to be a big `sex`

effect as well. This also appears in the grouped bar plot, where the red (“absent”) bar for males is much taller than the blue (“present”) bar, but for females the two bars are almost the same height. So the story is that we would expect a significant effect of `sex`

and `weight`

, and maybe of `age`

as well.

\(\blacksquare\)

- Fit a logistic regression predicting
`infected`

from the other three variables. Display the`summary`

.

Solution

Thus:

`.1 <- glm(infected ~ age + weight + sex, family = "binomial", data = infect) infect`

`Error in eval(family$initialize): y values must be 0 <= y <= 1`

Oh, I forgot to turn `infected`

into a factor. This is the shortcut way to do that:

```
.1 <- glm(factor(infected) ~ age + weight + sex, family = "binomial", data = infect)
infectsummary(infect.1)
```

```
Call:
glm(formula = factor(infected) ~ age + weight + sex, family = "binomial",
data = infect)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.609369 0.803288 0.759 0.448096
age 0.012653 0.006772 1.868 0.061701 .
weight -0.227912 0.068599 -3.322 0.000893 ***
sexmale -1.543444 0.685681 -2.251 0.024388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 59.859 on 77 degrees of freedom
AIC: 67.859
Number of Fisher Scoring iterations: 5
```

Or you could create a new or redefined column in the data frame containing the factor version of `infected`

, for example in this way:

```
%>%
infect mutate(infected = factor(infected)) %>%
glm(infected ~ age + weight + sex, family = "binomial", data = .) -> infect.1a
summary(infect.1a)
```

```
Call:
glm(formula = infected ~ age + weight + sex, family = "binomial",
data = .)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.609369 0.803288 0.759 0.448096
age 0.012653 0.006772 1.868 0.061701 .
weight -0.227912 0.068599 -3.322 0.000893 ***
sexmale -1.543444 0.685681 -2.251 0.024388 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 59.859 on 77 degrees of freedom
AIC: 67.859
Number of Fisher Scoring iterations: 5
```

Either way is good, and gives the same answer. The second way uses the `data=.`

trick to ensure that the input data frame to `glm`

is the output from the previous step, the one with the factor version of `infected`

in it. The `data=.`

is needed because `glm`

requires a model formula first rather than a data frame (if the data were first, you could just omit it).

\(\blacksquare\)

- * Which variables, if any, would you consider removing from the model? Explain briefly.

Solution

This is the same idea as in multiple regression: look at the end of the line for each variable to get its individual P-value, and if that’s not small, you can take that variable out. `age`

has a P-value of 0.062, which is (just) larger than 0.05, so we can consider removing this variable. The other two P-values, 0.00089 and 0.024, are definitely less than 0.05, so those variables should stay.

Alternatively, you can say that the P-value for `age`

is small enough to be interesting, and therefore that `age`

should stay. That’s fine, but then you need to be consistent in the next part.

You probably noted that `sex`

is categorical. However, it has only the obvious two levels, and such a categorical variable can be assessed for significance this way. If you were worried about this, the right way to go is `drop1`

:

`drop1(infect.1, test = "Chisq")`

The P-values are similar, but not identical.^{7}

I have to stop and think about this. There is a lot of theory that says there are several ways to do stuff in regression, but they are all identical. The theory doesn’t quite apply the same in generalized linear models (of which logistic regression is one): if you had an infinite sample size, the ways would all be identical, but in practice you’ll have a very finite amount of data, so they won’t agree.

I’m thinking about my aims here: I want to decide whether each \(x\)-variable should stay in the model, and for that I want a test that expresses whether the model fits significantly worse if I take it out. The result I get ought to be the same as physically removing it and comparing the models with `anova`

, eg. for `age`

:

```
.1b <- update(infect.1, . ~ . - age)
infectanova(infect.1b, infect.1, test = "Chisq")
```

This is the same thing as `drop1`

gives.

So, I think: use `drop1`

to assess whether anything should come out of a model like this, and use `summary`

to obtain the slopes to interpret (in this kind of model, whether they’re positive or negative, and thus what kind of effect each explanatory variable has on the probability of whatever-it-is.)

\(\blacksquare\)

Solution

I think they are extremely consistent. When we looked at the plots, we said that `weight`

and `sex`

had large effects, and they came out definitely significant. There was a small difference in age between the infected and non-infected groups, and `age`

came out borderline significant (with a P-value definitely larger than for the other variables, so that the evidence of its usefulness was weaker).

\(\blacksquare\)

- * The first and third quartiles of
`age`

are 26 and 130; the first and third quartiles of`weight`

are 9 and 16. Obtain predicted probabilities for all combinations of these and`sex`

. (You’ll need to start by making a new data frame, using`datagrid`

to get all the combinations.)

Solution

Here’s how `datagrid`

goes. Note my use of plural names to denote the things I want all combinations of:

```
<- c(26, 130)
ages <- c(9, 16)
weights <- c("female", "male")
sexes <- datagrid(model = infect.1, age = ages, weight = weights, sex = sexes)
new new
```

This is about on the upper end of what you would want to do just using the one line with `datagrid`

in it and putting the actual values in instead of `ages`

, `weights`

etc. To my mind, once it gets longer than about this long, doing on one line starts to get unwieldy. But your taste might be different than mine.

Aside:

I could have asked you to include some more values of `age`

and `weight`

, for example the median as well, to get a clearer picture. But that would have made `infect.new`

bigger, so I stopped here. (If we had been happy with *five* representative values of `age`

and `weight`

, we could have done `predictions`

(below) with `variables`

and not had to make `new`

at all.)

`datagrid`

*makes* a data frame from input vectors, so it doesn’t matter if those are different lengths. In fact, it’s also possible to make this data frame from things like quartiles stored in a data frame. To do that, you wrap the whole `datagrid`

in a `with`

:

```
<- tibble(age = ages, weight = weights, sex = sexes)
d d
```

```
<- with(d, datagrid(model = infect.1, age=age, weight=weight, sex=sex))
new new
```

This one is a little confusing because in eg. `age = age`

, the first `age`

refers to the column that will be in `new`

, and the second one refers to the values that are going in there, namely the column called `age`

*in the dataframe d*.^{8}

End of aside.

Next, the predictions:

`cbind(predictions(infect.1, new))`

Extra: I didn’t ask you to comment on these, since the question is long enough already. But that’s not going to stop me!

These, in `predicted`

, are predicted probabilities of infection.^{9}

The way I remember the one-column-response thing is that the first level is the baseline (as it is in a regression with a categorical explanatory variable), and the second level is the one whose probability is modelled (in the same way that the second, third etc. levels of a categorical explanatory variable are the ones that appear in the `summary`

table).

Let’s start with `sex`

. The probabilities of a female being infected are all much higher than of a corresponding male (with the same age and weight) being infected. Compare, for example, lines 1 and 2. Or 3 and 4. Etc. So `sex`

has a big effect.

What about `weight`

? As weight goes from 9 to 16, with everything else the same, the predicted probability of infection goes sharply *down*. This is what we saw before: precisely, the boxplot showed us that infected animals were likely to be less heavy.

Last, `age`

. As age goes up, the probabilities go (somewhat) up as well. Compare, for example, lines 1 and 5 or lines 4 and 8. I think this is a less dramatic change than for the other variables, but that’s a judgement call.

I got this example from (horrible URL warning) here: link It starts on page 275 in my edition. He goes at the analysis a different way, but he finishes with another issue that I want to show you.

Let’s work out the residuals and plot them against our quantitative explanatory variables. I think the best way to do this is `augment`

from `broom`

, to create a data frame containing the residuals alongside the original data:

```
library(broom)
.1a <- infect.1 %>% augment(infect)
infect.1a %>% as_tibble() infect
```

```
ggplot(infect.1a, aes(x = weight, y = .resid)) + geom_point() +
geom_smooth()
```

``geom_smooth()` using method = 'loess' and formula = 'y ~ x'`

`infect.1a`

is, I think, a genuine `data.frame`

rather than a `tibble`

.

I don’t quite know what to make of that plot. It doesn’t look quite random, and yet there are just some groups of points rather than any real kind of trend.

The corresponding plot with age goes the same way:

```
ggplot(infect.1a, aes(x = age, y = .resid)) + geom_point() +
geom_smooth()
```

``geom_smooth()` using method = 'loess' and formula = 'y ~ x'`

Crawley found the slightest suggestion of an up-and-down curve in there. I’m not sure I agree, but that’s what he saw. As with a regular regression, the residuals against anything should look random, with no trends. (Though the residuals from a logistic regression can be kind of odd, because the response variable can only be 1 or 0.) Crawley tries adding squared terms to the logistic regression, which goes like this. The `glm`

statement is long, as they usually are, so it’s much easier to use `update`

:

`.2 <- update(infect.1, . ~ . + I(age^2) + I(weight^2)) infect`

As we saw before, when thinking about what to keep, we want to look at `drop1`

:

`drop1(infect.2, test = "Chisq")`

The squared terms are both significant. The linear terms, `age`

and `weight`

, have to stay, regardless of their significance.^{10} What do the squared terms do to the predictions? Before, there was a clear one-directional trend in the relationships with `age`

and `weight`

. Has that changed? Let’s see. We’ll need a few more ages and weights to investigate with. I think the set of five representative values that comes out of `predictions`

with `variables`

would be ideal (and saves us having to make another `new`

).

Let’s start by assessing the effect of age:

`summary(infect)`

```
infected age weight sex
Length:81 Min. : 1.00 Min. : 1.00 Length:81
Class :character 1st Qu.: 26.00 1st Qu.: 9.00 Class :character
Mode :character Median : 87.00 Median :13.00 Mode :character
Mean : 83.65 Mean :11.49
3rd Qu.:130.00 3rd Qu.:16.00
Max. :206.00 Max. :18.00
```

```
<- datagrid(model = infect.2, age = c(1, 26, 84, 130, 206))
new new
```

```
cbind(predictions(infect.2, newdata = new)) %>%
select(estimate, age, weight, sex)
```

The actual values of age we chose are as shown. The other columns are constant; the values are the mean weight and the more common sex. We really can see the effect of age with all else constant.

Here, the predicted infection probabilities go up with age and then come down again, as a squared term in age will allow them to do, compared to what we had before:

```
cbind(predictions(infect.1, newdata = new)) %>%
select(estimate, age, weight, sex)
```

where the probabilities keep on going up.

All right, what about the effect of weight? Here’s the first model:

```
<- datagrid(model = infect.2, weight = c(1, 9, 13, 16, 18))
new cbind(predictions(infect.1, newdata = new)) %>%
select(estimate, age, weight, sex)
```

and here’s the second one with squared terms:

```
cbind(predictions(infect.2, newdata = new)) %>%
select(estimate, age, weight, sex)
```

This one is not so dissimilar: in the linear model, the predicted probabilities of infection start high and go down, but in the model with squared terms they go slightly up before going down.

We couldn’t have a squared term in `sex`

, since there are only two sexes (in this data set), so the predictions might be pretty similar for the two models:

```
<- datagrid(model = infect.2, sex = c("female", "male"))
new cbind(predictions(infect.1, newdata = new)) %>%
select(estimate, age, weight, sex)
```

and

```
cbind(predictions(infect.2, newdata = new)) %>%
select(estimate, age, weight, sex)
```

They are actually quite different. For the model with squared terms in age and weight, the predicted probabilities of infection are a lot higher for both males and females, at least at these (mean) ages and weights.

\(\blacksquare\)

## 27.15 The brain of a cat

A large number (315) of psychology students were asked to imagine that they were serving on a university ethics committee hearing a complaint against animal research being done by a member of the faculty. The students were told that the surgery consisted of implanting a device called a cannula in each cat’s brain, through which chemicals were introduced into the brain and the cats were then given psychological tests. At the end of the study, the cats’ brains were subjected to histological analysis. The complaint asked that the researcher’s authorization to carry out the study should be withdrawn, and the cats should be handed over to the animal rights group that filed the complaint. It was suggested that the research could just as well be done with computer simulations.

All of the psychology students in the survey were told all of this. In addition, they read a statement by the researcher that no animal felt much pain at any time, and that computer simulation was *not* an adequate substitute for animal research. Each student was also given *one* of the following scenarios that explained the benefit of the research:

“cosmetic”: testing the toxicity of chemicals to be used in new lines of hair care products.

“theory”: evaluating two competing theories about the function of a particular nucleus in the brain.

“meat”: testing a synthetic growth hormone said to potentially increase meat production.

“veterinary”: attempting to find a cure for a brain disease that is killing domesticated cats and endangered species of wild cats.

“medical”: evaluating a potential cure for a debilitating disease that afflicts many young adult humans.

Finally, each student completed two questionnaires: one that would assess their “relativism”: whether or not they believe in universal moral principles (low score) or whether they believed that the appropriate action depends on the person and situation (high score). The second questionnaire assessed “idealism”: a high score reflects a belief that ethical behaviour will always lead to good consequences (and thus that if a behaviour leads to any bad consequences at all, it is unethical).^{11}

After being exposed to all of that, each student stated their decision about whether the research should continue or stop.

I should perhaps stress at this point that no actual cats were harmed in the collection of these data (which can be found as a `.csv`

file at link). The variables in the data set are these:

`decision`

: whether the research should continue or stop (response)`idealism`

: score on idealism questionnaire`relativism`

: score on relativism questionnaire`gender`

of student`scenario`

of research benefits that the student read.

A more detailed discussion^{12} of this study is at link.

- Read in the data and check by looking at the structure of your data frame that you have something sensible.
*Do not*call your data frame`decision`

, since that’s the name of one of the variables in it.

Solution

So, like this, using the name `decide`

in my case:

```
<- "https://raw.githubusercontent.com/nxskok/datafiles/master/decision.csv"
my_url <- read_csv(my_url) decide
```

```
Rows: 315 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): decision, gender, scenario
dbl (2): idealism, relativism
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` decide`

The variables are all the right things and of the right types: the decision, gender and the scenario are all text (representing categorical variables), and idealism and relativism, which were scores on a test, are quantitative (numerical). There are, as promised, 315 observations.

\(\blacksquare\)

- Fit a logistic regression predicting
`decision`

from`gender`

. Is there an effect of gender?

Solution

Turn the response into a `factor`

somehow, either by creating a new variable in the data frame or like this:

```
.1 <- glm(factor(decision) ~ gender, data = decide, family = "binomial")
decidesummary(decide.1)
```

```
Call:
glm(formula = factor(decision) ~ gender, family = "binomial",
data = decide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.1543 5.491 3.99e-08 ***
genderMale -1.2167 0.2445 -4.976 6.50e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 425.57 on 314 degrees of freedom
Residual deviance: 399.91 on 313 degrees of freedom
AIC: 403.91
Number of Fisher Scoring iterations: 4
```

The P-value for gender is \(6.5 \times 10^{-7}\), which is very small, so there is definitely an effect of gender. It’s not immediately clear what kind of effect it is: that’s the reason for the next part, and we’ll revisit this slope coefficient in a moment. Categorical *explanatory* variables are perfectly all right as text. Should I have used `drop1`

to assess the significance? Maybe:

`drop1(decide.1, test = "Chisq")`

The thing is, this gives us a P-value but not a slope, which we might have wanted to try to interpret. Also, the P-value in `summary`

is so small that it is likely to be still significant in `drop1`

as well.

\(\blacksquare\)

- To investigate the effect (or non-effect) of
`gender`

, create a contingency table by feeding`decision`

and`gender`

into`table`

. What does this tell you?

Solution

`with(decide, table(decision, gender))`

```
gender
decision Female Male
continue 60 68
stop 140 47
```

Females are more likely to say that the study should stop (a clear majority), while males are more evenly split, with a small majority in favour of the study continuing.

If you want the column percents as well, you can use `prop.table`

. Two steps: save the table from above into a variable, then feed *that* into `prop.table`

, calling for column percentages rather than row percentages:

```
<- with(decide, table(decision, gender))
tab prop.table(tab, 2)
```

```
gender
decision Female Male
continue 0.3000000 0.5913043
stop 0.7000000 0.4086957
```

Why column percentages? Well, thinking back to STAB22 or some such place, when one of your variables is acting like a response or outcome (`decision`

here), make the percentages out of the *other* one. Given that a student is a female, how likely are they to call for the research to stop? The other way around makes less sense: given that a person wanted the research to stop, how likely are they to be female?

About 70% of females and 40% of males want the research to stop. That’s a giant-sized difference. No wonder it was significant.

The other way of making the table is to use `xtabs`

, with the same result:

`xtabs(~ decision + gender, data = decide)`

```
gender
decision Female Male
continue 60 68
stop 140 47
```

In this one, the frequency variable goes on the left side of the squiggle. We don’t have one here (each row of the data frame represents one student), so we leave the left side blank. I tried putting a `.`

there, but that doesn’t work since there is no “whatever was there before” as there is, for example, in `update`

.

As you’ll see later on with log-linear models, I now prefer making graphs with these. This is a grouped bar chart that we learned early in STAC32:

`ggplot(decide, aes(x = gender, fill = decision)) + geom_bar(position = "fill")`

I used `position = "fill"`

because I wanted to compare *proportions*. This scales each bar to have height 1, no matter how many people it’s based on. As a result, you see that if you look at females, most of them think the research as described should stop, and a small majority of the males think it should continue. This is why we have a gender effect.

\(\blacksquare\)

- * Is your slope for
`gender`

in the previous logistic regression positive or negative? Is it applying to males or to females? Looking at the conclusions from your contingency table, what probability does that mean your logistic regression is actually modelling?

Solution

My slope is \(-1.2167\), negative, and it is attached to males (note that the slope is called `gendermale`

: because “female” is before “male” alphabetically, females are used as the baseline and this slope says how males compare to them). This negative male coefficient means that the probability of whatever is being modelled is *less* for males than it is for females. Looking at the contingency table for the last part, the probability of “stop” should be less for males, so the logistic regression is actually modelling the probability of “stop”. Another way to reason that this must be the right answer is that the two values of `decision`

are `continue`

and `stop`

; `continue`

is first alphabetically, so it’s the baseline, and the *other* one, `stop`

, is the one whose probability is being modelled. That’s why I made you do that contingency table. Another way to think about this is to do a prediction, which would go like this:

```
<- datagrid(model = decide.1, gender = levels(factor(decide$gender)))
new new
```

`cbind(predictions(decide.1, newdata = new))`

Technical note: we could simply supply the two genders in the definition of `new`

, remembering to type the Capital Letters. The other way is to find out which genders there are in the data, and one way is to temporarily turn `gender`

into a `factor`

and then find out which different values it has, which are called “levels”.

The probability of whatever-it-is is exactly 70% for females and about 40% for males. A quick look at the contingency table shows that exactly 70% (\(140/200\)) of the females think the research should stop, and a bit less than 50% of the males think the same thing. So the model is predicting the probability of “stop”.

There’s a logic to this: it’s not just this way “because it is”. It’s the same idea of the first category, now of the response factor, being a “baseline”, and what actually gets modelled is the *second* category, relative to the baseline.

\(\blacksquare\)

- Add the two variables
`idealism`

and`relativism`

to your logistic regression. Do either or both of them add significantly to your model? Explain briefly.

Solution

The obvious way of doing this is to type out the entire model, with the two new variables on the end. You have to remember to turn `decision`

into a `factor`

again:

```
.2 <- glm(factor(decision) ~ gender + idealism + relativism,
decidedata = decide, family = "binomial"
)summary(decide.2)
```

```
Call:
glm(formula = factor(decision) ~ gender + idealism + relativism,
family = "binomial", data = decide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4876 0.9787 -1.520 0.12849
genderMale -1.1710 0.2679 -4.372 1.23e-05 ***
idealism 0.6893 0.1115 6.180 6.41e-10 ***
relativism -0.3432 0.1245 -2.757 0.00584 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 425.57 on 314 degrees of freedom
Residual deviance: 346.50 on 311 degrees of freedom
AIC: 354.5
Number of Fisher Scoring iterations: 4
```

This is not so bad, copying and pasting. But the way I like better, when you’re making a smallish change to a longish model, is to use `update`

:

```
.2 <- update(decide.1, . ~ . + idealism + relativism)
decidesummary(decide.2)
```

```
Call:
glm(formula = factor(decision) ~ gender + idealism + relativism,
family = "binomial", data = decide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4876 0.9787 -1.520 0.12849
genderMale -1.1710 0.2679 -4.372 1.23e-05 ***
idealism 0.6893 0.1115 6.180 6.41e-10 ***
relativism -0.3432 0.1245 -2.757 0.00584 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 425.57 on 314 degrees of freedom
Residual deviance: 346.50 on 311 degrees of freedom
AIC: 354.5
Number of Fisher Scoring iterations: 4
```

Either way is good. The conclusion you need to draw is that they both have something to add, because their P-values are both less than 0.05.

Or (and perhaps better) you can look at `drop1`

of either of these:

`drop1(decide.2, test = "Chisq")`

\(\blacksquare\)

- Add the variable
`scenario`

to your model. That is, fit a new model with that variable plus all the others.

Solution

To my mind, `update`

wins hands down here:

`.3 <- update(decide.2, . ~ . + scenario) decide`

You can display the summary here if you like, but we’re not going to look at it yet.

\(\blacksquare\)

- Use
`anova`

to compare the models with and without`scenario`

. You’ll have to add a`test="Chisq"`

to your`anova`

, to make sure that the test gets done. Does`scenario`

make a difference or not, at \(\alpha=0.10\)? Explain briefly. (The reason we have to do it this way is that`scenario`

is a factor with five levels, so it has four slope coefficients. To test them all at once, which is what we need to make an overall test for`scenario`

, this is the way it has to be done.)

Solution

These are the models that you fit in the last two parts:

`anova(decide.2, decide.3, test = "Chisq")`

The P-value is not less than 0.05, but it *is* less than 0.10, which is what I implied to assess it with, so the scenario does make some kind of difference.

Extra: another way to do this, which I like better (but the `anova`

way was what I asked in the original question), is to look at `decide.3`

and ask “what can I get rid of”, in such a way that categorical variables stay or go as a whole. This is done using `drop1`

. It’s a little different from the corresponding thing in regression because the right way to do the test is not an F test, but now a chi-squared test (this is true for all generalized linear models of which logistic regression is one):

`drop1(decide.3, test = "Chisq")`

The test for `scenario`

has four degrees of freedom (since there are five scenarios), and is in fact exactly the same test as in `anova`

, significant at \(\alpha=0.10\).

\(\blacksquare\)

- Look at the
`summary`

of your model that contained`scenario`

. Bearing in mind that the slope coefficient for`scenariocosmetic`

is zero (since this is the first scenario alphabetically), which scenarios have the most positive and most negative slope coefficients? What does that tell you about those scenarios’ effects?

Solution

All right. This is the model I called `decide.3`

:

`summary(decide.3)`

```
Call:
glm(formula = factor(decision) ~ gender + idealism + relativism +
scenario, family = "binomial", data = decide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5694 1.0426 -1.505 0.1322
genderMale -1.2551 0.2766 -4.537 5.70e-06 ***
idealism 0.7012 0.1139 6.156 7.48e-10 ***
relativism -0.3264 0.1267 -2.576 0.0100 *
scenariomeat 0.1565 0.4283 0.365 0.7149
scenariomedical -0.7095 0.4202 -1.688 0.0914 .
scenariotheory 0.4501 0.4271 1.054 0.2919
scenarioveterinary -0.1672 0.4159 -0.402 0.6878
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 425.57 on 314 degrees of freedom
Residual deviance: 338.06 on 307 degrees of freedom
AIC: 354.06
Number of Fisher Scoring iterations: 4
```

The most positive coefficient is for `theory`

and the most negative one is for `medical`

. (The zero coefficient is in the middle.) Since we are modelling the probability of saying that the research should *stop* (part (here)), this means that:

the “theory” scenario (evaluating theories about brain function) is most likely to lead to someone saying that the research should stop (other things being equal)

the “medical” scenario (finding a cure for a human disease) is most likely to lead to someone saying that the research should continue (or least likely to say that it should stop), again, other things being equal.

These make some kind of sense because being exposed to a scenario where there are tangible benefits later ought to be most favourable to the research continuing, and people are not going to be impressed by something that is “only theoretical” without any clear benefits.

We can also tackle this by doing some predictions. We want all the categories for `scenario`

, and we might as well use average values for everything else:

```
<- datagrid(model = decide.3, scenario = levels(factor(decide$scenario)))
new new
```

`cbind(predictions(decide.3, newdata = new)) %>% select(estimate, gender, idealism, relativism, scenario)`

The scenarios are over on the right, and the values of the other variables are the same all the way down (means for the quantitative ones, the most common category for the categorical ones). Having checked that this is indeed the case, we really only need the predictions and the scenarios.

This echoes what we found before: the probability of saying that the research should stop is highest for “theory” and the lowest for “medical”.

I assumed in my model that the effect of the scenarios was the same for males and females. If I wanted to test that, I’d have to add an interaction and test that. This works most nicely using `update`

and then `anova`

, to fit the model with interaction and compare it with the model without:

```
.4 <- update(decide.3, . ~ . + gender * scenario)
decideanova(decide.3, decide.4, test = "Chisq")
```

No evidence at all that the scenarios have different effects for the different genders. The appropriate `predictions`

should show that too:

```
<- datagrid(model = decide.4,
new gender = levels(factor(decide$gender)),
scenario = levels(factor(decide$scenario)))
cbind(predictions(decide.4, newdata = new)) %>% select(estimate, gender, idealism, relativism, scenario)
```

The predicted probabilities that the experiment should be stopped are a lot lower for males than females across the board (this is the strongly significant gender effect). But, given that, the probability for `medical`

is the lowest for both males and females, and the probability for `theory`

is the highest for females and almost the highest for males. (The pattern for males and females is similar enough to suggest that there should be no interaction.)

So fitting an interaction was a waste of time, but it was worth checking whether it was.

\(\blacksquare\)

- Describe the effects that having (i) a higher idealism score and (ii) a higher relativity score have on a person’s probability of saying that the research should stop. Do each of these increase or decrease that probability? Explain briefly.

Solution

Look back at the summary for the model that I called `decide.3`

. (Or `decide.2`

: the slope coefficients are very similar.) The one for `idealism`

is positive, so that a higher idealism score goes with a greater likelihood of saying that the research should stop. The slope coefficient for `relativity`

is negative, so it’s the other way around: a higher relativity score goes with a *lower* chance of saying that the research should stop.

That’s all I needed, but as an extra, we can look back at the description of these scales in the question.

The `relativism`

one was that a person believed that the most moral action depends on the situation (as opposed to a person having something like religious faith that asserts universal moral principles that are always true. That would be a low score on the relativism scale). Somebody with a low score on this scale might believe something like “it is always wrong to experiment on animals”, whereas somebody with a high relativism score might say that it was sometimes justified. Thus, other things being equal, a low relativism score would go with “stop” and a high relativism score would (or might) go with “continue”. This would mean a *negative* slope coefficient for `relativism`

, which is what we observed. (Are you still with me? There was some careful thinking there.)

What about `idealism`

? This is a belief that ethical behaviour will always lead to good consequences, and thus, if the consequences are bad, the behaviour must not have been ethical. A person who scores high on idealism is likely to look at the consequences (experimentation on animals), see that as a bad thing, and thus conclude that the research should be stopped. The `idealism`

slope coefficient, by that argument, should be positive, and is.

This can also be done by a prediction. In the light of what we have done before, the suggestion is this. Idealism and relativism are quantitative, so let’s grab their quartiles, giving us \(4 = 2 \times 2\) combinations:

```
<- datagrid(model = decide.3,
new idealism = quantile(decide$idealism, c(0.25, 0.75)),
relativism = quantile(decide$relativism, c(0.25, 0.75)))
new
```

`cbind(predictions(decide.3, newdata = new)) %>% select(estimate, gender, idealism, relativism, scenario)`

For both of the idealism scores, the higher relativism score went with a *lower* probability of “stop” (the “negative” effect), and for both of the relativism scores, the higher idealism score went with a *higher* probability of “stop” (the positive effect).

Yet another way to assess this would be to make a graph:

`plot_cap(decide.3, condition = c("relativism", "idealism"))`

The story from here is the same: as relativism increases (move from left to right), the probability of `stop`

decreases, but as idealism increases (from the red line up to the purple one), the probability of `stop`

*increases*.

That’s quite enough discussion of the question, except that the data didn’t come to me in the form that you see them, so I figured I would like to share the story of the data processing as well. I think this is important because in your future work you are likely to spend a lot of your time getting data from how you receive it to something suitable for analysis.

These data came from a psychology study (with, probably, the students in a class serving as experimental subjects). Social scientists like to use SPSS software, so the data came to me as an SPSS `.sav`

file.^{13} The least-fuss way of handling this that I could think of was to use `import`

from the `rio`

package, which I think I mentioned before:

```
library(rio)
<- import("/home/ken/Downloads/Logistic.sav")
x str(x)
```

```
'data.frame': 315 obs. of 11 variables:
$ decision : num 0 1 1 0 1 1 0 0 0 0 ...
..- attr(*, "format.spss")= chr "F1.0"
..- attr(*, "labels")= Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "stop" "continue"
$ idealism : num 8.2 6.8 8.2 7.4 1.7 5.6 7.2 7.8 7.8 8 ...
..- attr(*, "format.spss")= chr "F12.4"
$ relatvsm : num 5.1 5.3 6 6.2 3.1 7.7 6.7 4 4.7 7.6 ...
..- attr(*, "format.spss")= chr "F12.4"
$ gender : num 0 1 0 0 0 1 0 1 0 0 ...
..- attr(*, "format.spss")= chr "F1.0"
..- attr(*, "labels")= Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "Female" "Male"
$ cosmetic : num 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "format.spss")= chr "F1.0"
$ theory : num 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "format.spss")= chr "F1.0"
$ meat : num 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "format.spss")= chr "F1.0"
$ veterin : num 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "format.spss")= chr "F1.0"
$ idealism_LN: num 2.104 1.917 2.104 2.001 0.531 ...
..- attr(*, "format.spss")= chr "F8.2"
..- attr(*, "display_width")= int 13
$ relatvsm_LN: num 1.63 1.67 1.79 1.82 1.13 ...
..- attr(*, "format.spss")= chr "F8.2"
..- attr(*, "display_width")= int 13
$ scenario : num 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "format.spss")= chr "F8.2"
..- attr(*, "display_width")= int 10
```

The last line `str`

displays the “structure” of the data frame that was obtained. Normally a data frame read into R has a much simpler structure than this, but this is R trying to interpret how SPSS does things. Here, each column (listed on the lines beginning with a dollar sign) has some values, listed after `num`

; they are all numeric, even the categorical ones. What happened to the categorical variables is that they got turned into numbers, and they have a `names`

“attribute” further down that says what those numbers actually represent. Thus, on the `gender`

line, the subjects are a female (0), then a male (1), then three females, then a male, and so on. Variables like `gender`

are thus so far neither really factors nor text variables, and so we’ll have to do a bit of processing before we can use them: we want to replace the numerical values by the appropriate “level”.

To turn a numeric variable into text depending on the value, we can use `ifelse`

, but this gets unwieldy if there are more than two values to translate. For that kind of job, I think `case_when`

is a lot easier to read. It also lets us have a catch-all for catching errors — “impossible” values occur distressingly often in real data:

```
<- x %>%
xx mutate(
decision = case_when(
== 0 ~ "stop",
decision == 1 ~ "continue",
decision .default = "error"
),gender = case_when(
== 0 ~ "Female",
gender == 1 ~ "Male",
gender .default = "error"
),scenario = case_when(
== 1 ~ "cosmetic",
scenario == 2 ~ "theory",
scenario == 3 ~ "meat",
scenario == 4 ~ "veterinary",
scenario == 5 ~ "medical",
scenario .default = "error"
)
)%>% as_tibble() %>% select(-(cosmetic:veterin)) xx
```

`xx`

is a “real” `data.frame`

(that’s what `rio`

reads in), and has some extra columns that we don’t want to see right now.

I have three new variables being created in one `mutate`

. Each is being created using a `case_when`

. The thing on the left of each squiggle is a logical condition being tested; the first of these logical conditions to come out `TRUE`

provides the value for the new variable on the right of the squiggle. Thus, if the (old) `scenario`

is 2, the new `scenario`

will be `theory`

. The `TRUE`

lines in each case provide something that is guaranteed to be true, even if all the other lines are false (eg. if `scenario`

is actually recorded as 7, which would be an error).

I overwrote the old variable values with the new ones, which is a bit risky, but then I’d have more things to get rid of later.

My next step is to check that I don’t actually have any errors:

`%>% count(scenario, gender, decision) xx `

Don’t see any errors there.

So now let’s write what we have to a file. I think a `.csv`

would be smart:

```
%>%
xx select(decision, idealism, relatvsm, gender, scenario) %>%
write_csv("decision.csv")
```

There is one more tiny detail: in SPSS, variable names can have a maximum of eight letters. “Relativism” has 10. So the original data file had the name “relativism” minus the two “i”s. I changed that so you would be dealing with a proper English word. (That change is not shown here.)

There is actually a *town* called Catbrain. It’s in England, near Bristol, and seems to be home to a street of car dealerships. One of the questions in the chapter on making maps asks you to find out where it is exactly.

\(\blacksquare\)

## 27.16 How not to get heart disease

What is associated with heart disease? In a study, a large number of variables were measured, as follows:

`age`

(years)`sex`

male or female`pain.type`

Chest pain type (4 values: typical angina, atypical angina, non-anginal pain, asymptomatic)`resting.bp`

Resting blood pressure, on admission to hospital`serum.chol`

Serum cholesterol`high.blood.sugar`

: greater than 120, yes or no`electro`

resting electrocardiographic results (normal, having ST-T, hypertrophy)`max.hr`

Maximum heart rate`angina`

Exercise induced angina (yes or no)`oldpeak`

ST depression induced by exercise relative to rest. See link.`slope`

Slope of peak exercise ST segment. Sloping up, flat or sloping down`colored`

number of major vessels (0–3) coloured by fluoroscopy`thal`

normal, fixed defect, reversible defect`heart.disease`

yes, no (response)

I don’t know what most of those are, but we will not let that stand in our way. Our aim is to find out what variables are associated with heart disease, and what values of those variables give high probabilities of heart disease being present. The data are in link.

- Read in the data. Display the first few lines and convince yourself that those values are reasonable.

Solution

A `.csv`

file, so:

```
<- "https://raw.githubusercontent.com/nxskok/datafiles/master/heartf.csv"
my_url <- read_csv(my_url) heart
```

```
New names:
Rows: 270 Columns: 15
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): sex, pain.type, high.blood.sugar, electro, angina, slope, thal, hea... dbl
(7): ...1, age, resting.bp, serum.chol, max.hr, oldpeak, colored
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
```

` heart`

You should check that the variables that should be numbers actually are, that the variables that should be categorical have (as far as is shown) the right values as per my description above, and you should make some comment in that direction.

My variables appear to be correct, apart possibly for that variable `X1`

which is actually just the row number.

\(\blacksquare\)

- In a logistic regression, what probability will be predicted here? Explain briefly but convincingly. (Is each line of the data file one observation or a summary of several?)

Solution

Each line of the data file is a single observation, not frequencies of yes and no (like the premature babies question, later, is). The response variable is a factor, so the first level is the baseline and the *second* level is the one predicted. R puts factor levels alphabetically, so `no`

is first and `yes`

is second. That is, a logistic regression will predict the probability that a person *does* have heart disease. I want to see that logic (which is why I said “convincingly”): one observation per line, and therefore that the second level of the factor is predicted, which is `yes`

.

\(\blacksquare\)

- * Fit a logistic regression predicting heart disease from everything else (if you have a column called
`X`

or`X1`

, ignore that), and display the results.

Solution

A lot of typing, since there are so many variables. Don’t forget that the response variable *must* be a factor:

```
.1 <- glm(factor(heart.disease) ~ age + sex + pain.type + resting.bp + serum.chol +
heart+ electro + max.hr + angina + oldpeak + slope + colored + thal,
high.blood.sugar family = "binomial", data = heart
)
```

You can split this over several lines (and probably should), but make sure to end each line in such a way that there is unambiguously more to come, for example with a plus or a comma (though probably the fact that you have an unclosed bracket will be enough).

The output is rather lengthy:

`summary(heart.1)`

```
Call:
glm(formula = factor(heart.disease) ~ age + sex + pain.type +
resting.bp + serum.chol + high.blood.sugar + electro + max.hr +
angina + oldpeak + slope + colored + thal, family = "binomial",
data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.973837 3.133311 -1.268 0.204707
age -0.016007 0.026394 -0.606 0.544208
sexmale 1.763012 0.580761 3.036 0.002400 **
pain.typeatypical -0.997298 0.626233 -1.593 0.111264
pain.typenonanginal -1.833394 0.520808 -3.520 0.000431 ***
pain.typetypical -2.386128 0.756538 -3.154 0.001610 **
resting.bp 0.026004 0.012080 2.153 0.031346 *
serum.chol 0.006621 0.004228 1.566 0.117322
high.blood.sugaryes -0.370040 0.626396 -0.591 0.554692
electronormal -0.633593 0.412073 -1.538 0.124153
electroSTT 0.013986 3.184512 0.004 0.996496
max.hr -0.019337 0.011486 -1.683 0.092278 .
anginayes 0.596869 0.460540 1.296 0.194968
oldpeak 0.449245 0.244631 1.836 0.066295 .
slopeflat 0.827054 0.966139 0.856 0.391975
slopeupsloping -0.122787 1.041666 -0.118 0.906166
colored 1.199839 0.280947 4.271 1.95e-05 ***
thalnormal 0.146197 0.845517 0.173 0.862723
thalreversible 1.577988 0.838550 1.882 0.059863 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 370.96 on 269 degrees of freedom
Residual deviance: 168.90 on 251 degrees of freedom
AIC: 206.9
Number of Fisher Scoring iterations: 6
```

I didn’t ask you for further comment, but note that quite a lot of these variables are factors, so you get slopes for things like `pain.typeatypical`

. When you have a factor in a model, there is a slope for each level except for the first, which is a baseline (and its slope is taken to be zero). That would be `asymptomatic`

for `pain.type`

. The \(t\)-tests for the other levels of `pain.type`

say whether that level of pain type differs significantly (in terms of probability of heart disease) from the baseline level. Here, pain type `atypical`

is not significantly different from the baseline, but the other two pain types, `nonanginal`

and `typical`

, *are* significantly different. If you think about this from an ANOVA-like point of view, the question about `pain.type`

’s significance is really “is there at least one of the pain types that is different from the others”, and if we’re thinking about whether we should keep `pain.type`

in the logistic regression, this is the kind of question we should be thinking about.

\(\blacksquare\)

- Quite a lot of our explanatory variables are factors. To assess whether the factor as a whole should stay or can be removed, looking at the slopes won’t help us very much (since they tell us whether the other levels of the factor differ from the baseline, which may not be a sensible comparison to make). To assess which variables are candidates to be removed, factors included (properly), we can use
`drop1`

. Feed`drop1`

a fitted model and the words`test="Chisq"`

(take care of the capitalization!) and you’ll get a list of P-values. Which variable is the one that you would remove first? Explain briefly.

Solution

Following the instructions:

`drop1(heart.1, test = "Chisq")`

The highest P-value, 0.5525, goes with `high.blood.sugar`

, so this one comes out first. (The P-value for `age`

is almost as high, 0.5427, so you might guess that this will be next.)

You might be curious about how these compare with the P-values on `summary`

. These two P-values are almost the same as the ones on `summary`

, because they are a two-level factor and a numeric variable respectively, and so the tests are equivalent in the two cases. (The P-values are not identical because the tests on `summary`

and `drop1`

are the kind of thing that would be identical on a regular regression but are only “asymptotically the same” in logistic regression, so you’d expect them to be close without being the same, as here. “Asymptotically the same” means that if you had an infinitely large sample size, they’d be identical, but our sample size of 200-odd individuals is not infinitely large! Anyway, the largest P-value on the `summary`

is 0.9965, which goes with `electroSTT`

. `electro`

, though, is a factor with three levels; this P-value says that `STT`

is almost identical (in its effects on heart disease) with the baseline `hypertrophy`

. But there is a third level, `normal`

, which is a bit different from `hypertrophy`

. So the factor `electro`

overall has some effect on heart disease, which is reflected in the `drop1`

P-value of 0.12: this might go later, but it has to stay for now because at least one of its levels is different from the others in its effect on heart disease. (In backward elimination, multi-level factors are removed in their entirety if *none* of their levels have a different effect from any of the others.)

The power just went out here, so I am using my laptop on battery on its own screen, rather than on the big screen I have in my office, which is much better.

\(\blacksquare\)

- I’m not going to make you do the whole backward elimination (I’m going to have you use
`step`

for that later), but do one step: that is, fit a model removing the variable you think should be removed, using`update`

, and then run`drop1`

again to see which variable will be removed next.

Solution

`update`

is the obvious choice here, since we’re making a small change to a *very* big model:

```
.2 <- update(heart.1, . ~ . - high.blood.sugar)
heartdrop1(heart.2, test = "Chisq")
```

The power is back.

The next variable to go is indeed `age`

, with a P-value that has hardly changed: it is now 0.5218.

\(\blacksquare\)

- Use
`step`

to do a backward elimination to find which variables have an effect on heart disease. Display your final model (which you can do by saving the output from`step`

in a variable, and asking for the summary of that. In`step`

, you’ll need to specify a starting model (the one from part (here)), the direction of elimination, and the test to display the P-value for (the same one as you used in`drop1`

). (Note: the actual decision to keep or drop explanatory variables is based on AIC rather than the P-value, with the result that`step`

will sometimes keep variables you would have dropped, with P-values around 0.10.)

Solution

The hints ought to lead you to this:

`.3 <- step(heart.1, direction = "backward", test = "Chisq") heart`

```
Start: AIC=206.9
factor(heart.disease) ~ age + sex + pain.type + resting.bp +
serum.chol + high.blood.sugar + electro + max.hr + angina +
oldpeak + slope + colored + thal
Df Deviance AIC LRT Pr(>Chi)
- high.blood.sugar 1 169.25 205.25 0.3528 0.5525052
- age 1 169.27 205.27 0.3705 0.5427474
- electro 2 171.31 205.31 2.4119 0.2994126
- angina 1 170.55 206.55 1.6562 0.1981121
<none> 168.90 206.90
- slope 2 172.98 206.98 4.0844 0.1297422
- serum.chol 1 171.34 207.34 2.4484 0.1176468
- max.hr 1 171.84 207.84 2.9391 0.0864608 .
- oldpeak 1 172.44 208.44 3.5449 0.0597303 .
- resting.bp 1 173.78 209.78 4.8793 0.0271810 *
- thal 2 180.78 214.78 11.8809 0.0026308 **
- sex 1 179.16 215.16 10.2684 0.0013533 **
- pain.type 3 187.85 219.85 18.9557 0.0002792 ***
- colored 1 191.78 227.78 22.8878 1.717e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=205.25
factor(heart.disease) ~ age + sex + pain.type + resting.bp +
serum.chol + electro + max.hr + angina + oldpeak + slope +
colored + thal
Df Deviance AIC LRT Pr(>Chi)
- electro 2 171.65 203.65 2.3963 0.301750
- age 1 169.66 203.66 0.4104 0.521756
- angina 1 170.78 204.78 1.5323 0.215764
- slope 2 173.18 205.18 3.9288 0.140240
<none> 169.25 205.25
- serum.chol 1 171.69 205.69 2.4458 0.117841
- max.hr 1 172.35 206.35 3.0969 0.078440 .
- oldpeak 1 173.26 207.26 4.0094 0.045248 *
- resting.bp 1 173.84 207.84 4.5942 0.032080 *
- sex 1 179.27 213.27 10.0229 0.001546 **
- thal 2 181.61 213.61 12.3588 0.002072 **
- pain.type 3 190.54 220.54 21.2943 9.145e-05 ***
- colored 1 191.87 225.87 22.6232 1.971e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=203.65
factor(heart.disease) ~ age + sex + pain.type + resting.bp +
serum.chol + max.hr + angina + oldpeak + slope + colored +
thal
Df Deviance AIC LRT Pr(>Chi)
- age 1 172.03 202.03 0.3894 0.5326108
- angina 1 173.13 203.13 1.4843 0.2231042
<none> 171.65 203.65
- slope 2 175.99 203.99 4.3442 0.1139366
- max.hr 1 175.00 205.00 3.3560 0.0669599 .
- serum.chol 1 175.11 205.11 3.4610 0.0628319 .
- oldpeak 1 175.42 205.42 3.7710 0.0521485 .
- resting.bp 1 176.61 206.61 4.9639 0.0258824 *
- thal 2 182.91 210.91 11.2633 0.0035826 **
- sex 1 182.77 212.77 11.1221 0.0008531 ***
- pain.type 3 192.83 218.83 21.1859 9.632e-05 ***
- colored 1 194.90 224.90 23.2530 1.420e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=202.03
factor(heart.disease) ~ sex + pain.type + resting.bp + serum.chol +
max.hr + angina + oldpeak + slope + colored + thal
Df Deviance AIC LRT Pr(>Chi)
- angina 1 173.57 201.57 1.5385 0.2148451
<none> 172.03 202.03
- slope 2 176.33 202.33 4.2934 0.1168678
- max.hr 1 175.00 203.00 2.9696 0.0848415 .
- serum.chol 1 175.22 203.22 3.1865 0.0742492 .
- oldpeak 1 175.92 203.92 3.8856 0.0487018 *
- resting.bp 1 176.63 204.63 4.5911 0.0321391 *
- thal 2 183.38 209.38 11.3500 0.0034306 **
- sex 1 183.97 211.97 11.9388 0.0005498 ***
- pain.type 3 193.71 217.71 21.6786 7.609e-05 ***
- colored 1 195.73 223.73 23.6997 1.126e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=201.57
factor(heart.disease) ~ sex + pain.type + resting.bp + serum.chol +
max.hr + oldpeak + slope + colored + thal
Df Deviance AIC LRT Pr(>Chi)
<none> 173.57 201.57
- slope 2 178.44 202.44 4.8672 0.0877201 .
- serum.chol 1 176.83 202.83 3.2557 0.0711768 .
- max.hr 1 177.52 203.52 3.9442 0.0470322 *
- oldpeak 1 177.79 203.79 4.2135 0.0401045 *
- resting.bp 1 178.56 204.56 4.9828 0.0256006 *
- thal 2 186.22 210.22 12.6423 0.0017978 **
- sex 1 185.88 211.88 12.3088 0.0004508 ***
- pain.type 3 200.68 222.68 27.1025 5.603e-06 ***
- colored 1 196.98 222.98 23.4109 1.308e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The output is very long. In terms of AIC, which is what `step`

uses, `age`

hangs on for a bit, but eventually gets eliminated.

There are a lot of variables left.

\(\blacksquare\)

- Display the summary of the model that came out of
`step`

.

Solution

This:

`summary(heart.3)`

```
Call:
glm(formula = factor(heart.disease) ~ sex + pain.type + resting.bp +
serum.chol + max.hr + oldpeak + slope + colored + thal, family = "binomial",
data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.818418 2.550437 -1.889 0.058858 .
sexmale 1.850559 0.561583 3.295 0.000983 ***
pain.typeatypical -1.268233 0.604488 -2.098 0.035903 *
pain.typenonanginal -2.086204 0.486591 -4.287 1.81e-05 ***
pain.typetypical -2.532340 0.748941 -3.381 0.000722 ***
resting.bp 0.024125 0.011077 2.178 0.029410 *
serum.chol 0.007142 0.003941 1.812 0.069966 .
max.hr -0.020373 0.010585 -1.925 0.054262 .
oldpeak 0.467028 0.233280 2.002 0.045284 *
slopeflat 0.859564 0.922749 0.932 0.351582
slopeupsloping -0.165832 0.991474 -0.167 0.867167
colored 1.134561 0.261547 4.338 1.44e-05 ***
thalnormal 0.323543 0.813442 0.398 0.690818
thalreversible 1.700314 0.805127 2.112 0.034699 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 370.96 on 269 degrees of freedom
Residual deviance: 173.57 on 256 degrees of freedom
AIC: 201.57
Number of Fisher Scoring iterations: 6
```

Not all of the P-values in the `step`

output wound up being less than 0.05, but they are all at least reasonably small. As discussed above, some of the P-values in the `summary`

are definitely *not* small, but they go with factors where there are significant effects *somewhere*. For example, `thalnormal`

is not significant (that is, `normal`

is not significantly different from the baseline `fixed`

), but the other level `reversible`

*is* different from `fixed`

. You might be wondering about `slope`

: on the `summary`

there is nothing close to significance, but on the `step`

output, `slope`

has at least a reasonably small P-value of 0.088. This is because the significant difference does not involve the baseline: it’s actually between `flat`

with a positive slope and `upsloping`

with a negative one.

\(\blacksquare\)

- We are going to make a large number of predictions. Create and save a data frame that contains predictions for all combinations of representative values for all the variables in the model that came out of
`step`

. By “representative” I mean all the values for a categorical variable, and the five-number summary for a numeric variable. (Note that you will get a*lot*of predictions.)

Solution

The hard work is in listing all the variables. The easiest way to make sure you have them all is to look at the `summary`

of your best model (mine was called `heart.3`

) first, and copy them from the `Call`

at the top. This is easier than looking at the table of Coefficients (or `tidy`

output) because for categorical variables like `pain.type`

you will have to distinguish the name of the variable from its levels. For example, the table of Coefficients has `pain.typeatypical`

and `pain.typenonanginal`

. Is it obvious to you where the variable name ends and its level begins?^{14}

All right, let’s set up our dataframe to predict from. This needs the five-number summary of quantitative variables (via `quantile`

), and the levels of the categorical ones (via `levels(factor())`

). Take a deep breath and begin:

```
<- datagrid(model = heart.3,
new sex = levels(factor(heart$sex)),
pain.type = levels(factor(heart$pain.type)),
resting.bp = quantile(heart$resting.bp),
serum.chol = quantile(heart$serum.chol),
max.hr = quantile(heart$max.hr),
oldpeak = quantile(heart$oldpeak),
slope = levels(factor(heart$slope)),
colored = quantile(heart$colored),
thal = levels(factor(heart$thal)))
new
```