`library(tidyverse)`

# 22 Functions

## 22.1 Making some R functions

Let’s write some simple R functions to convert temperatures, and later to play with text.

A temperature in Celsius is converted to one in Kelvin by adding 273.15. (A temperature of \(-273.15\) Celsius, 0 Kelvin, is the “absolute zero” temperature that nothing can be colder than.) Write a function called

`c_to_k`

that converts an input Celsius temperature to one in Kelvin, and test that it works.Write a function to convert a Fahrenheit temperature to Celsius. The way you do that is to subtract 32 and then multiply by \(5/9\).

*Using the functions you already wrote*, write a function to convert an input Fahrenheit temperature to Kelvin.Rewrite your Fahrenheit-to-Celsius convertor to take a suitable default value and check that it works as a default.

What happens if you feed your Fahrenheit-to-Celsius convertor a

*vector*of Fahrenheit temperatures? What if you use it in a`mutate`

?Write another function called

`wrap`

that takes two arguments: a piece of text called`text`

, which defaults to`hello`

, and another piece of text called`outside`

, which defaults to`*`

. The function returns`text`

with the text`outside`

placed before and after, so that calling the function with the defaults should return`*hello*`

. To do this, you can use`str_c`

from`stringr`

(loaded with the`tidyverse`

) which places its text arguments side by side and glues them together into one piece of text. Test your function briefly.What happens if you want to change the default

`outside`

but use the default for`text`

? How do you make sure that happens? Explore.What happens if you feed your function

`wrap`

a vector for either of its arguments? What about if you use it in a`mutate`

?

## 22.2 The Collatz sequence

The Collatz sequence is a sequence of integers \(x_1, x_2, \ldots\) defined in a deceptively simple way: if \(x_n\) is the current term of the sequence, then \(x_{n+1}\) is defined as \(x_n/2\) if \(x_n\) is even, and \(3x_n+1\) if \(x_n\) is odd. We are interested in understanding how this sequence behaves; for example, what happens to it as \(n\) gets large, for different choices of the first term \(x_1\)? We will explore this numerically with R; the ambitious among you might like to look into the mathematics of it.

What happens to the sequence when it reaches 4? What would be a sensible way of defining where it ends? Explain briefly.

Write an R function called

`is_odd`

that returns`TRUE`

if its input is an odd number and`FALSE`

if it is even (you can assume that the input is an integer and not a decimal number). To do*that*, you can use the function`%%`

where`a %% b`

is the remainder when`a`

is divided by`b`

. To think about oddness or evenness, consider the remainder when you divide by 2.Write an R function called

`hotpo1`

^{1}that takes an integer as input and returns the next number in the Collatz sequence. To do this, use the function you just wrote that determines whether a number is even or odd.Now write a function

`hotpo`

that will return the whole Collatz sequence for an input \(x_1\). For this, assume that you will eventually get to 1.Write two (very small) functions that take an entire sequence as input and return (i) the length of the sequence and (ii) the maximum value it attains.

Make a data frame consisting of the values 11 through 20, and, using

`tidyverse`

ideas, obtain a data frame containing the Collatz sequences starting at each of those values, along with their lengths and their maximum values. Which sequence is longest? Which one goes up highest?

## 22.3 Coefficient of Variation

The coefficient of variation of a vector `x`

is defined as the standard deviation of `x`

divided by the mean of `x`

.

Write a function called

`cv`

that calculates the coefficient of variation of its input and returns the result. You should use base R’s functions that reliably compute the pieces that you need.Use your function to find the coefficient of variation of the set of integers 1 through 5.

Define a vector as follows:

`<- c(-2.8, -1.8, -0.8, 1.2, 4.2) v `

What is its coefficient of variation, according to your function? Does this make sense? Why did this happen? Explain briefly.

- Most people only calculate a coefficient of variation if there are no negative numbers. Rewrite your function so that it gives an error if there are any negative numbers in the input, and test it with the vector
`v`

above. Hint: you might need to add`#|error=TRUE`

as the first line of your code chunk, to allow your document to render properly.

## 22.4 Rescaling

Suppose we have this vector of values:

```
<- c(10, 14, 11)
z z
```

`[1] 10 14 11`

We want to scale these so that the smallest value is 0 and the largest is 1. We are going to be doing this a lot, so we are going to write a function that will work for any input.

Using a copy of my

`z`

, work out`min(z)`

and`max(z)`

. What do they do? Explain (very) briefly.What do these lines of code do, using the same

`z`

that I had? Run them and see, and describe briefly what`s`

contains.

```
<- min(z)
lo <- max(z)
hi <- (z - lo) / (hi - lo)
s s
```

Write a function called

`rescale`

that implements the calculation above, for any input vector called`x`

. (Note that I changed the name.)Test your function on my

`z`

, and on another vector of your choosing. Explain briefly why the answer you get from your vector makes sense.What happens if your input to

`rescale`

is a vector of numbers all the same? Give an example. Rewrite your function to intercept this case and give a helpful error message.Make a dataframe (containing any numeric values), and in it create a new column containing the rescaled version of one of its columns, using your function. Show your result.

We might want to rescale the input not to be between 0 and 1, but between two values

`a`

and`b`

that we specify as input. If`a`

and/or`b`

are not given, we want to use the values 0 for`a`

and 1 for`b`

. Rewrite your function to rescale the input to be between`a`

and`b`

instead of 0 and 1. Hint: allow your function to produce values between 0 and 1 as before, and then note that if all the values in a vector`s`

are between 0 and 1, then all the values in`a+(b-a)*s`

are between \(a\) and \(b\).Test your new function two or more times, on input where you know or can guess what the output is going to be. In each case, explain briefly why your output makes sense.

My solutions follow:

## 22.5 Making some R functions

Let’s write some simple R functions to convert temperatures, and later to play with text.

- A temperature in Celsius is converted to one in Kelvin by adding 273.15. (A temperature of \(-273.15\) Celsius, 0 Kelvin, is the “absolute zero” temperature that nothing can be colder than.) Write a function called
`c_to_k`

that converts an input Celsius temperature to one in Kelvin, and test that it works.

Solution

This is mostly an exercise in structuring your function correctly. Let’s call the input `C`

(uppercase C, since lowercase c has a special meaning to R):

```
<- function(C) {
c_to_k + 273.15
C
}c_to_k(0)
```

`[1] 273.15`

`c_to_k(20)`

`[1] 293.15`

This is the simplest way to do it: the last line of the function, if calculated but not saved, is the value that gets returned to the outside world. The checks suggest that it worked.

If you’re used to Python or similar, you might prefer to calculate the value to be returned and then return it. You can do that in R too:

```
<- function(C) {
c_to_k <- C + 273.15
K return(K)
}c_to_k(0)
```

`[1] 273.15`

`c_to_k(20)`

`[1] 293.15`

That works just as well, and for the rest of this question, you can go either way.^{2}

\(\blacksquare\)

- Write a function to convert a Fahrenheit temperature to Celsius. The way you do that is to subtract 32 and then multiply by \(5/9\).

Solution

On the model of the previous one, we should call this `f_to_c`

. I’m going to return the last line, but you can save the calculated value and return that instead:

```
<- function(F) {
f_to_c - 32) * 5 / 9
(F
}f_to_c(32)
```

`[1] 0`

`f_to_c(50)`

`[1] 10`

`f_to_c(68)`

`[1] 20`

Americans are very good at saying things like “temperatures in the 50s”, which don’t mean much to me, so I like to have benchmarks to work with: these are the Fahrenheit versions of 0, 10, and 20 Celsius.

Thus “in the 50s” means “between about 10 and 15 Celsius”.

\(\blacksquare\)

*Using the functions you already wrote*, write a function to convert an input Fahrenheit temperature to Kelvin.

Solution

This implies that you can piggy-back on the functions you just wrote, which goes as below. First you convert the Fahrenheit to Celsius, and then you convert *that* to Kelvin. (This is less error-prone than trying to use algebra to get a formula for this conversion and then implementing that):

```
<- function(F) {
f_to_k <- f_to_c(F)
C <- c_to_k(C)
K return(K)
}f_to_k(32)
```

`[1] 273.15`

`f_to_k(68)`

`[1] 293.15`

These check because in Celsius they are 0 and 20 and we found the Kelvin equivalents of those to be these values earlier.

I wrote this one with a `return`

because I thought it made the structure clearer: run one function, save the result, run another function, save the result, then return what you’ve got.

\(\blacksquare\)

- Rewrite your Fahrenheit-to-Celsius convertor to take a suitable default value and check that it works as a default.

Solution

You can choose any default you like. I’ll take a default of 68 (what I would call “a nice day”):

```
<- function(F = 68) {
f_to_c - 32) * 5 / 9
(F
}f_to_c(68)
```

`[1] 20`

`f_to_c()`

`[1] 20`

The change is in the top line of the function. You see the result: if we run it without an input, we get the same answer as if the input had been 68.

\(\blacksquare\)

- What happens if you feed your Fahrenheit-to-Celsius convertor a
*vector*of Fahrenheit temperatures? What if you use it in a`mutate`

?

Solution

Try it and see:

```
<- seq(30, 80, 10)
temps temps
```

`[1] 30 40 50 60 70 80`

`f_to_c(temps)`

`[1] -1.111111 4.444444 10.000000 15.555556 21.111111 26.666667`

Each of the Fahrenheit temperatures gets converted into a Celsius one. This is perhaps more useful in a data frame, thus:

```
tibble(temps = seq(30, 80, 10)) %>%
mutate(celsius = f_to_c(temps))
```

All the temperatures are side-by-side with their equivalents.

Here’s another way to do the above:

```
<- seq(30, 80, 10)
temps %>%
temps enframe(value = "fahrenheit") %>%
mutate(celsius = f_to_c(temps))
```

`enframe`

creates a two-column data frame out of a vector (like `temps`

). A vector can have “names”, in which case they’ll be used as the `name`

column; the values will go in a column called `value`

unless you rename it, as I did.

\(\blacksquare\)

- Write another function called
`wrap`

that takes two arguments: a piece of text called`text`

, which defaults to`hello`

, and another piece of text called`outside`

, which defaults to`*`

. The function returns`text`

with the text`outside`

placed before and after, so that calling the function with the defaults should return`*hello*`

. To do this, you can use`str_c`

from`stringr`

(loaded with the`tidyverse`

) which places its text arguments side by side and glues them together into one piece of text. Test your function briefly.

Solution

This:

```
<- function(text = "hello", outside = "*") {
wrap str_c(outside, text, outside)
}
```

I can run this with the defaults:

`wrap()`

`[1] "*hello*"`

or with text of my choosing:

`wrap("goodbye", "_")`

`[1] "_goodbye_"`

I think that’s what I meant by “test briefly”.

\(\blacksquare\)

- What happens if you want to change the default
`outside`

but use the default for`text`

? How do you make sure that happens? Explore.

Solution

The obvious thing is this, which doesn’t work:

`wrap("!")`

`[1] "*!*"`

This takes * text* to be

`!`

, and `outside`

to be the default. How do we get `outside`

to be `!`

instead? The key is to specify the input by name:`wrap(outside = "!")`

`[1] "!hello!"`

This correctly uses the default for `text`

.

If you specify inputs without names, they are taken to be in the order that they appear in the function definition. As soon as they get out of order, which typically happens by using the default for something early in the list, as we did here for `text`

, you have to specify names for anything that comes after that. These are the names you put on the function’s top line.

You can always use names:

`wrap(text = "thing", outside = "**")`

`[1] "**thing**"`

and if you use names, they don’t even have to be in order:

`wrap(outside = "!?", text = "fred")`

`[1] "!?fred!?"`

\(\blacksquare\)

- What happens if you feed your function
`wrap`

a vector for either of its arguments? What about if you use it in a`mutate`

?

Solution

Let’s try:

```
<- c("a", "b", "c")
mytext wrap(text = mytext)
```

`[1] "*a*" "*b*" "*c*"`

```
<- c("*", "!")
myout wrap(outside = myout)
```

`[1] "*hello*" "!hello!"`

If one of the inputs is a vector, the other one gets “recycled” as many times as the vector is long. What if they’re both vectors?

```
<- c("a", "b", "c", "d")
mytext2 wrap(mytext2, myout)
```

```
Error in `str_c()`:
! Can't recycle `..1` (size 2) to match `..2` (size 4).
```

This gives an error because `str_c`

won’t let you recycle things that are both vectors.

The `mutate`

thing is easier, because all the columns in a data frame have to be the same length. `LETTERS`

is a vector with the uppercase letters in it:

```
tibble(mytext = LETTERS[1:6], myout = c("*", "**", "!", "!!", "_", "__")) %>%
mutate(newthing = wrap(mytext, myout))
```

\(\blacksquare\)

## 22.6 The Collatz sequence

The Collatz sequence is a sequence of integers \(x_1, x_2, \ldots\) defined in a deceptively simple way: if \(x_n\) is the current term of the sequence, then \(x_{n+1}\) is defined as \(x_n/2\) if \(x_n\) is even, and \(3x_n+1\) if \(x_n\) is odd. We are interested in understanding how this sequence behaves; for example, what happens to it as \(n\) gets large, for different choices of the first term \(x_1\)? We will explore this numerically with R; the ambitious among you might like to look into the mathematics of it.

- What happens to the sequence when it reaches 4? What would be a sensible way of defining where it ends? Explain briefly.

Solution

When the sequence reaches 4 (that is, when its current term is 4), the next term is 2 and the one after that is 1. Then the following term is 4 again (\((3 \times 1)+1\)) and then it repeats indefinitely, \(4, 2, 1, 4, 2, 1, \ldots\). I think a sensible way to define where the sequence ends is to say “when it reaches 1”, since if you start at 2 you’ll never reach 4 (so “when it reaches 4” won’t work), and it seems at least plausible that it will hit the cycle 4, 2, 1 sometime.

\(\blacksquare\)

- Write an R function called
`is_odd`

that returns`TRUE`

if its input is an odd number and`FALSE`

if it is even (you can assume that the input is an integer and not a decimal number). To do*that*, you can use the function`%%`

where`a %% b`

is the remainder when`a`

is divided by`b`

. To think about oddness or evenness, consider the remainder when you divide by 2.

Solution

Let’s try this out. For example, 5 is odd and 6 is even, so

`5 %% 2`

`[1] 1`

`6 %% 2`

`[1] 0`

When a number is odd, its remainder on dividing by 2 is 1, and when even, the remainder is 0. There is an additional shortcut here in that 1 is the numeric value of `TRUE`

and 0 of `FALSE`

, so all we have to do is calculate the remainder on dividing by 2, turn it into a `logical`

, and return it:

```
<- function(x) {
is_odd <- x %% 2
r as.logical(r)
}
```

You probably haven’t seen `as.logical`

before, but it’s the same idea as `as.numeric`

: turn something that looks like a `TRUE`

or `FALSE`

into something that actually *is*.

We should test it:

`is_odd(19)`

`[1] TRUE`

`is_odd(12)`

`[1] FALSE`

`is_odd(0)`

`[1] FALSE`

0 is usually considered an even number, so this is good.

\(\blacksquare\)

- Write an R function called
`hotpo1`

^{3}that takes an integer as input and returns the next number in the Collatz sequence. To do this, use the function you just wrote that determines whether a number is even or odd.

Solution

The logic is “if the input is odd, return 3 times it plus 1, otherwise return half of it”. The R structure is an `if-then-else`

:

```
<- function(x) {
hotpo1 if (is_odd(x)) 3 * x + 1 else x / 2
}
```

In R, the condition that is tested goes in brackets, and then if the value-if-true and the value-if-false are single statements, you just type them. (If they are more complicated than that, you put them in curly brackets.) Now you see the value of writing `is_odd`

earlier; this code almost looks like the English-language description of the sequence. If we had not written `is_odd`

before, the condition would have looked something like

`if (x %% 2 == 1) 3 * x + 1 else x / 2`

which would have been a lot harder to read.

All right, let’s try that out:

`hotpo1(4)`

`[1] 2`

`hotpo1(7)`

`[1] 22`

`hotpo1(24)`

`[1] 12`

That looks all right so far.

\(\blacksquare\)

- Now write a function
`hotpo`

that will return the whole Collatz sequence for an input \(x_1\). For this, assume that you will eventually get to 1.

Solution

This is a loop, but not a `for`

loop (or something that we could do `rowwise`

), because we don’t know how many times we have to go around. This is the kind of thing that we should use a `while`

loop for: “keep going while a condition is true”. In this case, we should keep going if we haven’t reached 1 yet. If we haven’t reached 1, we should generate the next value of the sequence and glue it onto what we have so far. To initialize the sequence, we start with the input value. There is an R trick to glue a value onto the sequence, which is to use `c`

with a vector and a value, and save it back into the vector:

```
<- function(x) {
hotpo <- x
sequence <- x
term while (term > 1) {
<- hotpo1(term)
term <- c(sequence, term)
sequence
}
sequence }
```

I use `term`

to hold the current term of the sequence, and overwrite it with the next one (since I don’t need the old one any more).

Does it work?

`hotpo(4)`

`[1] 4 2 1`

`hotpo(12)`

` [1] 12 6 3 10 5 16 8 4 2 1`

`hotpo(97)`

```
[1] 97 292 146 73 220 110 55 166 83 250 125 376 188 94 47
[16] 142 71 214 107 322 161 484 242 121 364 182 91 274 137 412
[31] 206 103 310 155 466 233 700 350 175 526 263 790 395 1186 593
[46] 1780 890 445 1336 668 334 167 502 251 754 377 1132 566 283 850
[61] 425 1276 638 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288
[76] 3644 1822 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732
[91] 866 433 1300 650 325 976 488 244 122 61 184 92 46 23 70
[106] 35 106 53 160 80 40 20 10 5 16 8 4 2 1
```

97 is a wild ride, but it does eventually get to 1.

Extra: where I originally saw this, which was “Metamagical Themas” by Douglas Hofstadter, he was illustrating the programming language Lisp and the process of recursion, whereby you define a function in terms of itself. This one is a natural for that, because the Collatz sequence starting at \(x\) is \(x\) along with the Collatz sequence starting at the next term. For example, if you start at 12, the next term is 6, so that the Collatz sequence starting at 12 is 12 followed by the Collatz sequence starting at 6. There is no dependence any further back. You can do recursion in R also; there is no problem with a function calling itself:

```
<- function(x) {
hotpo_rec if (x == 1) 1 else c(x, hotpo_rec(hotpo1(x)))
}
```

Recursive functions have two parts: a “base case” that says how you know you are done (the 1 here), and a “recursion” that says how you move to a simpler case, here working out the next term, getting the whole sequence for that, and gluing the input onto the front. It seems paradoxical that you define a function in terms of itself, but what you are doing is calling a simpler sequence, in this case one that is length one shorter than the sequence for the original input. Thus, we hope,^{4} we will eventually reach 1.

Does it work?

`hotpo_rec(12)`

` [1] 12 6 3 10 5 16 8 4 2 1`

`hotpo_rec(97)`

```
[1] 97 292 146 73 220 110 55 166 83 250 125 376 188 94 47
[16] 142 71 214 107 322 161 484 242 121 364 182 91 274 137 412
[31] 206 103 310 155 466 233 700 350 175 526 263 790 395 1186 593
[46] 1780 890 445 1336 668 334 167 502 251 754 377 1132 566 283 850
[61] 425 1276 638 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288
[76] 3644 1822 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732
[91] 866 433 1300 650 325 976 488 244 122 61 184 92 46 23 70
[106] 35 106 53 160 80 40 20 10 5 16 8 4 2 1
```

It does.

Recursive functions are often simple to understand, but they are not always very efficient. They can take a lot of memory, because they have to handle the intermediate calls to the function, which they have to save to use later (in the case of `hotpo_rec(97)`

there are a lot of those). Recursive functions are often paired with a technique called “memoization”, where each time you calculate the function’s value, you *save* it in another array. The first thing you do in the recursive function is to check whether you already have the answer, in which case you just look it up and return it. It was a lot of work here to calculate the sequence from 97, but if we had saved the results, we would already have the answers for 292, 146, 73, 220 and so on, and getting those later would be a table lookup rather than another recursive calculation.

\(\blacksquare\)

- Write two (very small) functions that take an entire sequence as input and return (i) the length of the sequence and (ii) the maximum value it attains.

Solution

These are both one-liners. Call the input whatever you like:

```
<- function(sequence) length(sequence)
hotpo_len <- function(sequence) max(sequence) hotpo_max
```

Because they are one-liners, you don’t even need the curly brackets, although there’s no problem if they are there.

Testing:

`hotpo_len(hotpo(12))`

`[1] 10`

`hotpo_max(hotpo(97))`

`[1] 9232`

This checks with what we had before.

\(\blacksquare\)

- Make a data frame consisting of the values 11 through 20, and, using
`tidyverse`

ideas, obtain a data frame containing the Collatz sequences starting at each of those values, along with their lengths and their maximum values. Which sequence is longest? Which one goes up highest?

Solution

This one uses `rowwise`

ideas:^{5}

```
tibble(x = 11:20) %>%
%>%
rowwise mutate(sequence = list(hotpo(x))) %>%
mutate(length = hotpo_len(sequence)) %>%
mutate(high = hotpo_max(sequence))
```

First, we obtain a list-column containing the sequences (which is why its calculation needs a `list`

around it), then two ordinary columns of their lengths and their maximum values.

The sequences for 18 and 19 are the longest, but the sequence for 15 goes up the highest.

\(\blacksquare\)

## 22.7 Coefficient of Variation

The coefficient of variation of a vector `x`

is defined as the standard deviation of `x`

divided by the mean of `x`

.

- Write a function called
`cv`

that calculates the coefficient of variation of its input and returns the result. You should use base R’s functions that reliably compute the pieces that you need.

Solution

I like to make a function skeleton first to be sure I remember all the pieces. You can do this by making a code chunk, typing `fun`

, waiting a moment, and selecting the “snippet” that is offered to you. That gives you this, with your cursor on `name`

:

```
<- function(variables) {
name
}
```

Give your function a name (I asked you to use `cv`

), hit tab, enter a name like x for the input, hit tab again, and then fill in the body of the function, to end up with something like this:

```
<- function(x) {
cv <- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x }
```

I think this is best, for reasons discussed below.

Some observations:

- in the last line of the function, you calculate something, and that something is the thing that gets returned. You
*do not*need to save it in a variable, because then you have to return that variable by having its name alone on the last line. - R has a
`return()`

function, but it is*bad style*to use it to return a value at the end of a function. The time to use it is when you test something early in the function and want to return early with a value like zero or missing without doing any actual calculation. Looking ahead a bit, you might want to return “missing” if the mean is zero*before*dividing by it, which you could do like this, this being*good*style:

```
<- function(x) {
cv2 <- mean(x)
mean_x if (mean_x == 0) return(NA)
<- sd(x)
sd_x /mean_x
sd_x }
```

use the built-in

`mean`

and`sd`

rather than trying to roll your own, because they have been tested by many users over a long time and they work. That’s what I meant by “pieces” in the question.you might be tempted to do something like this:

```
<- function(x) {
cv3 sd(x)/mean(x)
}
```

This will work, but it is not a good habit to get into, and thus not best as an answer to this question. This one line of code says three things: “work out the SD of `x`

”, “work out the mean of `x`

”, and “divide them by each other”. It is much clearer, for anyone else reading the code (including yourself in six months when you have forgotten what you were thinking) to have the three things one per line, so that anyone reading the code sees that each line does one thing and what that one thing is. There is a reason why production code and code golf are two very different things. Code that is easy to read is also easy to maintain, either by you or others.

- using something other than
`mean`

and`sd`

as the names of your intermediate results is a good idea because these names are already used by R (as the names of the functions that compute the mean and SD). Redefining names that R uses can make other code behave unpredictably and cause hard-to-find bugs.^{6}

\(\blacksquare\)

- Use your function to find the coefficient of variation of the set of integers 1 through 5.

Solution

This can be as simple as

`cv(1:5)`

`[1] 0.5270463`

or, equally good, define this into a vector first:

```
<- 1:5
y cv(y)
```

`[1] 0.5270463`

or, displaying a little less understanding, type the five numbers into the vector first (or directly as input to the function):

```
<- c(1, 2, 3, 4, 5)
y cv(y)
```

`[1] 0.5270463`

`cv(c(1, 2, 3, 4, 5))`

`[1] 0.5270463`

\(\blacksquare\)

- Define a vector as follows:

`<- c(-2.8, -1.8, -0.8, 1.2, 4.2) v `

What is its coefficient of variation, according to your function? Does this make sense? Why did this happen? Explain briefly.

Solution

Try it and see:

`cv(v)`

`[1] 6.248491e+16`

A very large number, much bigger than any of the data values; with these human-sized numbers, we’d expect a human-sized coefficient of variation as well.

What actually happened was that the mean of `v`

is this:

`mean(v)`

`[1] 4.440892e-17`

Zero, or close enough, so that in calculating the coefficient of variation, we divided by something that was (almost) zero and got a result that was (almost) infinite.^{7}

The possibility of getting a zero mean is why most people only calculate a coefficient of variation if all of the numbers are positive, which brings us to the next part:

\(\blacksquare\)

- Most people only calculate a coefficient of variation if there are no negative numbers. Rewrite your function so that it gives an error if there are any negative numbers in the input, and test it with the vector
`v`

above. Hint: you might need to add`#|error=TRUE`

as the first line of your code chunk, to allow your document to render properly.

Solution

This is a case for `stopifnot`

, or of `if`

coupled with `stop`

. Check this up front, as the first thing you do before you calculate anything else. As to what to check, there are several possibilities:

- stop if any of the numbers are negative
- continue if all of the numbers are positive
- stop if the
*smallest*number is negative (the smallest number is negative if and only if not all the numbers are positive)

R has functions `any`

and `all`

that do what you’d expect:

```
<- 1:5
w w
```

`[1] 1 2 3 4 5`

`any(w>3.5)`

`[1] TRUE`

`all(w<4.5)`

`[1] FALSE`

Are there any numbers greater than 3.5 (yes, 4 and 5); are all the numbers less than 4.5 (no, 5 isn’t).

Cite your sources for these if you use either of them, since this is the first place in the course that I’m mentioning either of them.

Remember that if you use `stopifnot`

, the condition that goes in there is what has to be true if the function is to run; if you use `if`

and `stop`

, the condition is what will *stop* the function running. With that in mind, I would code my three possibilities above this way. First off, here’s the original:

```
<- function(x) {
cv <- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x }
```

then, stop if any of the numbers are negative:

```
<- function(x) {
cv if (any(x<0)) stop("A value is negative")
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): A value is negative`

continue if all the numbers are positive

```
<- function(x) {
cv stopifnot(all(x>0))
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): all(x > 0) is not TRUE`

stop if the smallest value is negative

```
<- function(x) {
cv if (min(x)<0) stop("Smallest value is negative")
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): Smallest value is negative`

There are (at least) three other possibilities: you can negate the logical condition and interchange `if/stop`

and `stopifnot`

, thus (at the expense of some clarity of reading):

continue if it is not true that any of the numbers are negative

```
<- function(x) {
cv stopifnot(!any(x<0))
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): !any(x < 0) is not TRUE`

(you might be thinking of De Morgan’s laws here)

stop if it is not true that all the numbers are positive

```
<- function(x) {
cv if (!all(x>0)) stop("Not all values are positive")
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): Not all values are positive`

continue if the smallest value is not negative

```
<- function(x) {
cv stopifnot(min(x)>=0)
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): min(x) >= 0 is not TRUE`

or another way to do the last one, a more direct negation of the condition, which at my guess needs some extra brackets:

```
<- function(x) {
cv stopifnot(!(min(x)<0))
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
```

`Error in cv(v): !(min(x) < 0) is not TRUE`

This one is hard to parse: what does that last message mean? I would take a negative off each side and read it as “min of x is negative is TRUE”, but that takes extra effort.

I said that last one needed some extra brackets. This is, I thought, to get the order of operations right (operator precedence); it turns out not to matter because “not” has lower precedence than most other things, so that these do actually work (the “not” is evaluated *after* the less-than and the other things, so last of all here, even though it appears to be “glued” to the `min`

):

`!min(v)<0`

`[1] FALSE`

`!min(1:5)<0`

`[1] TRUE`

See this for details. See especially the second set of examples, the ones beginning with “Special operators”, and see especially-especially the comment at the bottom of these examples! That is to say, *you* should put in the extra brackets unless you also make the case that they are not needed, because anyone reading your code is guaranteed to be confused by it when they read it (including you in six months, because you *will not* remember the operator priority of “not”).

My take is that one of the first three of the seven possibilities for coding `stopifnot`

or `if`

with `stop`

is the best, since these more obviously encode the condition for continuing or stopping as appropriate. There are two things here: one is that you have to get the code right, but the second is that you have to get the code *clear*, so that it is obvious to anyone reading it that it does the right thing (once again, this includes you in six months). On that score, the first three alternatives are a direct expression of what you want to achieve, and the last four make it look as if you found a way of coding it that worked and stopped there, without thinking about whether there were any other, clearer or more expressive, possibilities.

\(\blacksquare\)

## 22.8 Rescaling

Suppose we have this vector of values:

```
<- c(10, 14, 11)
z z
```

`[1] 10 14 11`

We want to scale these so that the smallest value is 0 and the largest is 1. We are going to be doing this a lot, so we are going to write a function that will work for any input.

- Using a copy of my
`z`

, work out`min(z)`

and`max(z)`

. What do they do? Explain (very) briefly.

Solution

Simply this – define `z`

first:

```
<- c(10, 14, 11)
z min(z)
```

`[1] 10`

`max(z)`

`[1] 14`

They are respectively (and as you would guess) the smallest and largest of the values in `z`

. (A nice gentle warmup, but I wanted to get you on the right track for what is coming up.)

\(\blacksquare\)

- What do these lines of code do, using the same
`z`

that I had? Run them and see, and describe briefly what`s`

contains.

```
<- min(z)
lo <- max(z)
hi <- (z - lo) / (hi - lo)
s s
```

Solution

Here we go:

```
<- min(z)
lo <- max(z)
hi <- (z - lo) / (hi - lo)
s s
```

`[1] 0.00 1.00 0.25`

The lowest value became 0, the highest 1, and the other one to something in between. Saying this shows the greatest insight.

Extra: the reason for doing it in three steps rather than one (see below) is (i) it makes it clearer what is going on (and thus makes it less likely that you make a mistake), and (ii) it is more efficient, since my way only finds the minimum once instead of twice. Compare this approach with mine above:

`- min(z)) / (max(z) - min(z)) (z `

`[1] 0.00 1.00 0.25`

More complicated with all the brackets, and two `min(z)`

. Admittedly the difference here will be thousandths of a second, but why call a function twice when you don’t have to?

\(\blacksquare\)

- Write a function called
`rescale`

that implements the calculation above, for any input vector called`x`

. (Note that I changed the name.)

Solution

Write a function skeleton:

```
<- function(x) {
rescale
}
```

and inside the curly brackets put the code from above, replacing `z`

with `x`

everywhere:

```
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi - lo) / (hi - lo)
(x }
```

You’ll need to make sure your function returns something to the outside world. Either don’t save the last line in `s`

(as I did here), or save it in `s`

and then return `s`

:

```
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi <- (x - lo) / (hi - lo)
s
s }
```

or use `return()`

if you must, but be aware that this is bad style in R (unlike Python, where you need it). The approved way of using `return`

in R is when you are returning something *earlier* than the last line of a function, for example, you are testing a simple case first and returning the value that goes with that, before getting down to the serious computation.

Extra: in the spirit of what’s coming up below, you might check first whether the maximum and minimum are the same and return something else if that’s the case:

```
<- function(x) {
rescale0 <- min(x)
lo <- max(x)
hi if (lo == hi) return(0)
<- (x - lo) / (hi - lo)
s
s }
```

This is good style; in this case, if `lo`

and `hi`

are the same, we want to return something else (zero) to the outside world, and *then* we do the calculation, knowing that `lo`

and `hi`

are different, so that we are sure we are not dividing by zero (but I get ahead of myself).

Doing it this way, there is something to be careful of: a function ought to return a predictable *type* of thing: numbers, in this case. If you have your function return *text* on error, like this:

```
<- function(x) {
rescale0 <- min(x)
lo <- max(x)
hi if (lo == hi) return("High and low need to be different")
<- (x - lo) / (hi - lo)
s
s }
```

then you can get into trouble:

`rescale0(z)`

`[1] 0.00 1.00 0.25`

`rescale0(c(3,3,3))`

`[1] "High and low need to be different"`

The first one is numbers and the second one is text.

\(\blacksquare\)

- Test your function on my
`z`

, and on another vector of your choosing. Explain briefly why the answer you get from your vector makes sense.

Solution

On my `z`

:

`rescale(z)`

`[1] 0.00 1.00 0.25`

The same values as I got before, so that works.

For your vector, use whatever you like. I think it makes sense to have the values already in order, to make it easier to check. Here’s one possibility:

```
<- 2:6
w w
```

`[1] 2 3 4 5 6`

and then

`rescale(w)`

`[1] 0.00 0.25 0.50 0.75 1.00`

The smallest value is 2, which goes to zero; the largest is 6, which goes to 1, and the others are equally spaced between in both the input and the output.

Another possibility is to use a vector with values whose largest and smallest you can clearly see:

```
<- c(10, 11, 100, 0, 20)
w rescale(w)
```

`[1] 0.10 0.11 1.00 0.00 0.20`

Clearly the smallest value is 0 and the largest is 100. These become 0 and 1, and these particular values make it easy to see what happened: each of the other values got divided by 100.

Some discussion is needed here, in that you need to say something convincing about why your answer is right.

Extra: This is why I had you use a name *other than* `z`

for the input to your function. The function can be used on any input, not just the `z`

that we tested it on. There’s another R-specific reason, which is that you need to be careful about using the named inputs *only*. Consider this function:

```
<- function(x) {
ff + z
x
}
ff(10)
```

`[1] 20 24 21`

Where did `z`

come from? R *used the z we had before*, which is rather dangerous: what if we had a

`z`

lying around from some completely different work? Much better to have a function work with only inputs in the top line:```
<- function(x, z) {
ff + z
x
}ff(10, 3)
```

`[1] 13`

`ff(10)`

`Error in ff(10): argument "z" is missing, with no default`

The first time, the two inputs are added together, but the second time it tells you it was expecting a value to use for `z`

and didn’t see one. Much safer.

\(\blacksquare\)

- What happens if your input to
`rescale`

is a vector of numbers all the same? Give an example. Rewrite your function to intercept this case and give a helpful error message.

Solution

First, try it and see. Any collection of values all the same will do:

`rescale(c(3,3,3))`

`[1] NaN NaN NaN`

`NaN`

stands for “not a number”. The way we got it is that the minimum and maximum were the same, so our function ended up dividing by zero (in fact, working out zero divided by zero). This is, in R terms, not even an error, but the answer is certainly not helpful.

The easiest way to check inputs is to use `stopifnot`

to express what should be *true* if the function is to proceed. Here, we want the maximum and minimum to be different, so:

```
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
- lo) / (hi - lo)
(x
}rescale(c(3,3,3))
```

`Error in rescale(c(3, 3, 3)): hi != lo is not TRUE`

This is much clearer: I only have to recall what my `hi`

and `lo`

are to see what the problem is.

Extra 1: by calculating and saving the min and max up front, I still only need to calculate them once. If you do it this way:

```
<- function(x) {
rescale stopifnot(max(x) != min(x))
- min(x)) / (max(x) - min(x))
(x
}rescale(c(3,3,3))
```

`Error in rescale(c(3, 3, 3)): max(x) != min(x) is not TRUE`

you get a slightly more informative error message, but you have calculated the max *twice* and the min *three times* for no reason.

Extra 2: `stopifnot`

is shorthand for this:

```
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi if (hi == lo) stop("min and max are the same!")
- lo) / (hi - lo)
(x
}rescale(c(3,3,3))
```

`Error in rescale(c(3, 3, 3)): min and max are the same!`

I didn’t show you this, so if you use `stop`

, you must tell me where you found out about it. This is better than returning some text (see `rescale0`

above) or printing a message: it’s an error, so you want to make it look like an error. I am very sympathetic to being persuaded that this is *better* than `stopifnot`

, because you can customize the message (and, also, you don’t have to go through the double-negative contortions of `stopifnot`

). Another way to use `stopifnot`

and get a customized message is this one (that I only learned about right when you were writing this Assignment):

```
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi stopifnot("high and low must be different" = (hi != lo))
- lo) / (hi - lo)
(x
}rescale(c(3,3,3))
```

`Error in rescale(c(3, 3, 3)): high and low must be different`

This is called a “named argument”, and the name, if given, is used as an error message.

Extra 3: returning to my `rescale0`

from above:

` rescale0`

```
function(x) {
lo <- min(x)
hi <- max(x)
if (lo == hi) return("High and low need to be different")
s <- (x - lo) / (hi - lo)
s
}
<bytecode: 0x55b5817e0898>
```

this can get you into trouble if you use it in a dataframe. This is a bit complicated, since it has to use list-columns. Here we go:

`tibble(x = list(z, c(3,3,3)))`

Just to check that this does contain what you think it does:

`tibble(x = list(z, c(3,3,3))) %>% unnest(x)`

So now, for each of those two input vectors, what happens when we run `rescale0`

on them? This is `rowwise`

:

```
tibble(x = list(z, c(3,3,3))) %>%
rowwise() %>%
mutate(ans = list(rescale0(x)))
```

The first `ans`

is a vector of 3 numbers, and the second one is one piece of text (the “error message”). I was actually surprised it got this far. So what happens when we `unnest`

the second column?

```
tibble(x = list(z, c(3,3,3))) %>%
rowwise() %>%
mutate(ans = list(rescale0(x))) %>%
unnest(ans)
```

```
Error in `list_unchop()`:
! Can't combine `x[[1]]` <double> and `x[[2]]` <character>.
```

Now we get a confusing error: it’s *here* that combining some numbers and some text in one column of a dataframe doesn’t work. To forestall this, we need to go back and rewrite `rescale0`

to not mix things up. Having it return an error, as the latest version of `rescale`

does, gives an error here too, but at least we know what it means:

```
tibble(x = list(z, c(3,3,3))) %>%
rowwise() %>%
mutate(ans = list(rescale(x))) %>%
unnest(ans)
```

```
Error in `mutate()`:
ℹ In argument: `ans = list(rescale(x))`.
ℹ In row 2.
Caused by error in `rescale()`:
! high and low must be different
```

because this is the error we anticipated: it says “somewhere within the list-column `x`

, specifically in its second row, is a vector where everything is the same”.

\(\blacksquare\)

- Make a dataframe (containing any numeric values), and in it create a new column containing the rescaled version of one of its columns, using your function. Show your result.

Solution

This is less difficult than you might be expecting: make a dataframe with at least one numeric column, and use `mutate`

:

```
<- tibble(y=2:6)
d d
```

and then

`%>% mutate(s=rescale(y)) d `

You can supply the values for what I called `y`

, or use random numbers. It’s easier for you to check that it has worked if your column playing the role of my `y`

has not too many values in it.

Extra: this is actually already in the `tidyverse`

under the name `percent_rank`

(“percentile ranks”):

`%>% mutate(s = percent_rank(y)) d `

The value 5, for example, is at the 75th percentile.

\(\blacksquare\)

- We might want to rescale the input not to be between 0 and 1, but between two values
`a`

and`b`

that we specify as input. If`a`

and/or`b`

are not given, we want to use the values 0 for`a`

and 1 for`b`

. Rewrite your function to rescale the input to be between`a`

and`b`

instead of 0 and 1. Hint: allow your function to produce values between 0 and 1 as before, and then note that if all the values in a vector`s`

are between 0 and 1, then all the values in`a+(b-a)*s`

are between \(a\) and \(b\).

Solution

I’m showing you my thought process in this one. The answer I want from you is the one at the end.

So, start by copying and pasting what you had before:

```
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
- lo) / (hi - lo)
(x }
```

On the top line, add the extra inputs and their default values. I also changed the name of my function, for reasons you’ll see later:

```
<- function(x, a=0, b=1) {
rescale2 <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
- lo) / (hi - lo)
(x }
```

Save the last line, since we have to do something else with it:

```
<- function(x, a=0, b=1) {
rescale2 <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
<- (x - lo) / (hi - lo)
s }
```

and finally add the calculation in the hint, which we don’t need to save because we are returning it:

```
<- function(x, a=0, b=1) {
rescale2 <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
<- (x - lo) / (hi - lo)
s + (b-a) * s
a }
```

This complete function is what I want to see from you. (You should keep the `stopifnot`

, because this function will have the exact same problem as the previous one if all the values in `x`

are the same.)

A better way is to observe that you can call functions inside functions. The function above is now a bit messy since it has several steps. Something that corresponds better to my hint is to call the original `rescale`

first, and then modify its result:

```
<- function(x, a=0, b=1) {
rescale3 <- rescale(x)
s + (b-a) * s
a }
```

The logic to this is rather clearly “rescale the input to be between 0 and 1, then rescale *that* to be between \(a\) and \(b\).” My `rescale2`

does exactly the same thing, but it’s much less clear that it does so, unless you happen to have in your head how `rescale`

works. (I think you are more likely to remember, sometime in the future, what `rescale`

*does*, compared to precisely how it *works*.)

That is why `rescale3`

is better than `rescale2`

. Remember that you can, and generally should, use functions that have already been written (by you or someone else) as part of functions that do more complex things. See also my second point below.

Extra: there are two important principles about why functions are important:

- they allow you to re-do a calculation on many different inputs (the point I’ve been making up to now)
- by abstracting a calculation into a thing with a name, it makes it easier to understand that calculation’s role in something bigger. The only thing we had to remember in
`rescale3`

is what the last line did, because the name of the function called on the first line tells us what happens there. This is much easier than remembering what the first four lines of`rescale2`

do.

The second principle here is what psychologists call “chunking”: you view a thing like my function `rescale`

as a single item, rather than as four separate lines of code, and then that single item can be part of something larger (like my `rescale3`

), and you have a smaller number of things to keep track of.

\(\blacksquare\)

- Test your new function two or more times, on input where you know or can guess what the output is going to be. In each case, explain briefly why your output makes sense.

Solution

I’ll start by using the default values for `a`

and `b`

(so I don’t have to specify them):

`rescale2(2:6)`

`[1] 0.00 0.25 0.50 0.75 1.00`

`rescale3(2:6)`

`[1] 0.00 0.25 0.50 0.75 1.00`

I did both of the variants of my function; of course, you’ll only have one variant.

We got the same answer as before for the same input, so the default values \(a=0, b=1\) look as if they have been used.

Let’s try a different one:

```
<- c(7, 11, 12)
v rescale2(v, 10, 30)
```

`[1] 10 26 30`

`rescale3(v, 10, 30)`

`[1] 10 26 30`

The lowest value in `v`

has become 10, and the highest has become 30. (Also, the in-between value 11 was closer to 12 than to 7, and it has become something closer to 30 than to 10.)

Extra: you can also name any of your inputs:

`rescale3(x=v, a=10, b=30)`

`[1] 10 26 30`

and if you name them, you can shuffle the order too:

`rescale3(a=10, b=30, x=v)`

`[1] 10 26 30`

The point of doing more than one test is to check that different aspects of your function all work. Therefore, the best testing here checks that the defaults work, and that the answer is sensible for some different `a`

and `b`

(to check that this works as well).

When you write your version of `rescale`

with the optional inputs, it’s best if you do it so that the things you have to supply (the vector of numbers) is *first*. If you put `a`

and `b`

first, when you want to omit them, you’ll have to call the input vector by name, like this:

`rescale3(x=v)`

`[1] 0.0 0.8 1.0`

because otherwise the input vector will be taken to be `a`

, not what you want.

\(\blacksquare\)

*Hotpo*is short for half or triple-plus-one.↩︎R style is to use the last line of the function for the return value, unless you are jumping out of the function before the end, in which case use

`return`

.↩︎*Hotpo*is short for half or triple-plus-one.↩︎Nobody knows whether you

*always*get to 1, but also nobody has ever found a case where you don’t. Collatz’s conjecture, that you will get to 1 eventually, is known to be true for all starting \(x_1\) up to some absurdly large number, but not for*all*starting points.↩︎I should have been more careful in my functions to make sure everything was integers, and, in particular, to do integer division by 2 because I knew that this division was going to come out even.↩︎

This is something I’ve done in the past and been bitten by, so I am trying to get myself not to do it any more.↩︎

In computing with decimal numbers, things are almost never exactly zero or exactly infinite; they are very small or very big. The mean here, which you would calculate to be zero, is less than the so-called machine epsilon, which is about 10 to the minus 16 in R (R works in double precision). A mean that small is, to the computer, indistinguishable from zero. It came out that way because the last value is added to a total that is by that point negative, and so you have a loss of accuracy because of subtracting nearly equal quantities. I learned all about this stuff in my first real computer science course, which I think was a numerical math course, some absurd number of years ago. It looks as if this gets taught in CSCC37 these days.↩︎