Chapter 22 Functions

library(tidyverse)

22.1 Making some R functions

Let’s write some simple R functions to convert temperatures, and later to play with text.

  1. A temperature in Celsius is converted to one in Kelvin by adding 273.15. (A temperature of \(-273.15\) Celsius, 0 Kelvin, is the “absolute zero” temperature that nothing can be colder than.) Write a function called c_to_k that converts an input Celsius temperature to one in Kelvin, and test that it works.

  2. Write a function to convert a Fahrenheit temperature to Celsius. The way you do that is to subtract 32 and then multiply by \(5/9\).

  3. Using the functions you already wrote, write a function to convert an input Fahrenheit temperature to Kelvin.

  4. Rewrite your Fahrenheit-to-Celsius convertor to take a suitable default value and check that it works as a default.

  5. What happens if you feed your Fahrenheit-to-Celsius convertor a vector of Fahrenheit temperatures? What if you use it in a mutate?

  6. Write another function called wrap that takes two arguments: a piece of text called text, which defaults to hello, and another piece of text called outside, which defaults to *. The function returns text with the text outside placed before and after, so that calling the function with the defaults should return *hello*. To do this, you can use str_c from stringr (loaded with the tidyverse) which places its text arguments side by side and glues them together into one piece of text. Test your function briefly.

  7. What happens if you want to change the default outside but use the default for text? How do you make sure that happens? Explore.

  8. What happens if you feed your function wrap a vector for either of its arguments? What about if you use it in a mutate?

22.2 The Collatz sequence

The Collatz sequence is a sequence of integers \(x_1, x_2, \ldots\) defined in a deceptively simple way: if \(x_n\) is the current term of the sequence, then \(x_{n+1}\) is defined as \(x_n/2\) if \(x_n\) is even, and \(3x_n+1\) if \(x_n\) is odd. We are interested in understanding how this sequence behaves; for example, what happens to it as \(n\) gets large, for different choices of the first term \(x_1\)? We will explore this numerically with R; the ambitious among you might like to look into the mathematics of it.

  1. What happens to the sequence when it reaches 4? What would be a sensible way of defining where it ends? Explain briefly.

  2. Write an R function called is_odd that returns TRUE if its input is an odd number and FALSE if it is even (you can assume that the input is an integer and not a decimal number). To do that, you can use the function %% where a %% b is the remainder when a is divided by b. To think about oddness or evenness, consider the remainder when you divide by 2.

  3. Write an R function called hotpo135 that takes an integer as input and returns the next number in the Collatz sequence. To do this, use the function you just wrote that determines whether a number is even or odd.

  4. Now write a function hotpo that will return the whole Collatz sequence for an input \(x_1\). For this, assume that you will eventually get to 1.

  5. Write two (very small) functions that take an entire sequence as input and return (i) the length of the sequence and (ii) the maximum value it attains.

  6. Make a data frame consisting of the values 11 through 20, and, using tidyverse ideas, obtain a data frame containing the Collatz sequences starting at each of those values, along with their lengths and their maximum values. Which sequence is longest? Which one goes up highest?

22.3 Coefficient of Variation

The coefficient of variation of a vector x is defined as the standard deviation of x divided by the mean of x.

  1. Write a function called cv that calculates the coefficient of variation of its input and returns the result. You should use base R’s functions that reliably compute the pieces that you need.

  2. Use your function to find the coefficient of variation of the set of integers 1 through 5.

  3. Define a vector as follows:

v <- c(-2.8, -1.8, -0.8, 1.2, 4.2)

What is its coefficient of variation, according to your function? Does this make sense? Why did this happen? Explain briefly.

  1. Most people only calculate a coefficient of variation if there are no negative numbers. Rewrite your function so that it gives an error if there are any negative numbers in the input, and test it with the vector v above. Hint: you might need to add error=TRUE to your chunk header to allow your document to preview/knit (inside the curly brackets at the top of the chunk, after a comma).

22.4 Rescaling

Suppose we have this vector of values:

z <- c(10, 14, 11)
z
## [1] 10 14 11

We want to scale these so that the smallest value is 0 and the largest is 1. We are going to be doing this a lot, so we are going to write a function that will work for any input.

  1. Using a copy of my z, work out min(z) and max(z). What do they do? Explain (very) briefly.

  2. What do these lines of code do, using the same z that I had? Run them and see, and describe briefly what s contains.

lo <- min(z)
hi <- max(z)
s <- (z - lo) / (hi - lo)
s
  1. Write a function called rescale that implements the calculation above, for any input vector called x. (Note that I changed the name.)

  2. Test your function on my z, and on another vector of your choosing. Explain briefly why the answer you get from your vector makes sense.

  3. What happens if your input to rescale is a vector of numbers all the same? Give an example. Rewrite your function to intercept this case and give a helpful error message.

  4. Make a dataframe (containing any numeric values), and in it create a new column containing the rescaled version of one of its columns, using your function. Show your result.

  5. We might want to rescale the input not to be between 0 and 1, but between two values a and b that we specify as input. If a and/or b are not given, we want to use the values 0 for a and 1 for b. Rewrite your function to rescale the input to be between a and b instead of 0 and 1. Hint: allow your function to produce values between 0 and 1 as before, and then note that if all the values in a vector s are between 0 and 1, then all the values in a+(b-a)*s are between \(a\) and \(b\).

  6. Test your new function two or more times, on input where you know or can guess what the output is going to be. In each case, explain briefly why your output makes sense.

My solutions follow:

22.5 Making some R functions

Let’s write some simple R functions to convert temperatures, and later to play with text.

  1. A temperature in Celsius is converted to one in Kelvin by adding 273.15. (A temperature of \(-273.15\) Celsius, 0 Kelvin, is the “absolute zero” temperature that nothing can be colder than.) Write a function called c_to_k that converts an input Celsius temperature to one in Kelvin, and test that it works.

Solution

This is mostly an exercise in structuring your function correctly. Let’s call the input C (uppercase C, since lowercase c has a special meaning to R):

c_to_k <- function(C) {
  C + 273.15
}
c_to_k(0)
## [1] 273.15
c_to_k(20)
## [1] 293.15

This is the simplest way to do it: the last line of the function, if calculated but not saved, is the value that gets returned to the outside world. The checks suggest that it worked.

If you’re used to Python or similar, you might prefer to calculate the value to be returned and then return it. You can do that in R too:

c_to_k <- function(C) {
  K <- C + 273.15
  return(K)
}
c_to_k(0)
## [1] 273.15
c_to_k(20)
## [1] 293.15

That works just as well, and for the rest of this question, you can go either way.36

\(\blacksquare\)

  1. Write a function to convert a Fahrenheit temperature to Celsius. The way you do that is to subtract 32 and then multiply by \(5/9\).

Solution

On the model of the previous one, we should call this f_to_c. I’m going to return the last line, but you can save the calculated value and return that instead:

f_to_c <- function(F) {
  (F - 32) * 5 / 9
}
f_to_c(32)
## [1] 0
f_to_c(50)
## [1] 10
f_to_c(68)
## [1] 20

Americans are very good at saying things like “temperatures in the 50s”, which don’t mean much to me, so I like to have benchmarks to work with: these are the Fahrenheit versions of 0, 10, and 20 Celsius.

Thus “in the 50s” means “between about 10 and 15 Celsius”.

\(\blacksquare\)

  1. Using the functions you already wrote, write a function to convert an input Fahrenheit temperature to Kelvin.

Solution

This implies that you can piggy-back on the functions you just wrote, which goes as below. First you convert the Fahrenheit to Celsius, and then you convert that to Kelvin. (This is less error-prone than trying to use algebra to get a formula for this conversion and then implementing that):

f_to_k <- function(F) {
  C <- f_to_c(F)
  K <- c_to_k(C)
  return(K)
}
f_to_k(32)
## [1] 273.15
f_to_k(68)
## [1] 293.15

These check because in Celsius they are 0 and 20 and we found the Kelvin equivalents of those to be these values earlier.

I wrote this one with a return because I thought it made the structure clearer: run one function, save the result, run another function, save the result, then return what you’ve got.

\(\blacksquare\)

  1. Rewrite your Fahrenheit-to-Celsius convertor to take a suitable default value and check that it works as a default.

Solution

You can choose any default you like. I’ll take a default of 68 (what I would call “a nice day”):

f_to_c <- function(F = 68) {
  (F - 32) * 5 / 9
}
f_to_c(68)
## [1] 20
f_to_c()
## [1] 20

The change is in the top line of the function. You see the result: if we run it without an input, we get the same answer as if the input had been 68.

\(\blacksquare\)

  1. What happens if you feed your Fahrenheit-to-Celsius convertor a vector of Fahrenheit temperatures? What if you use it in a mutate?

Solution

Try it and see:

temps <- seq(30, 80, 10)
temps
## [1] 30 40 50 60 70 80
f_to_c(temps)
## [1] -1.111111  4.444444 10.000000 15.555556 21.111111 26.666667

Each of the Fahrenheit temperatures gets converted into a Celsius one. This is perhaps more useful in a data frame, thus:

tibble(temps = seq(30, 80, 10)) %>%
  mutate(celsius = f_to_c(temps))
## # A tibble: 6 × 2
##   temps celsius
##   <dbl>   <dbl>
## 1    30   -1.11
## 2    40    4.44
## 3    50   10   
## 4    60   15.6 
## 5    70   21.1 
## 6    80   26.7

All the temperatures are side-by-side with their equivalents.

Here’s another way to do the above:

temps <- seq(30, 80, 10)
temps %>%
  enframe(value = "fahrenheit") %>%
  mutate(celsius = f_to_c(temps))
## # A tibble: 6 × 3
##    name fahrenheit celsius
##   <int>      <dbl>   <dbl>
## 1     1         30   -1.11
## 2     2         40    4.44
## 3     3         50   10   
## 4     4         60   15.6 
## 5     5         70   21.1 
## 6     6         80   26.7

enframe creates a two-column data frame out of a vector (like temps). A vector can have “names”, in which case they’ll be used as the name column; the values will go in a column called value unless you rename it, as I did.

\(\blacksquare\)

  1. Write another function called wrap that takes two arguments: a piece of text called text, which defaults to hello, and another piece of text called outside, which defaults to *. The function returns text with the text outside placed before and after, so that calling the function with the defaults should return *hello*. To do this, you can use str_c from stringr (loaded with the tidyverse) which places its text arguments side by side and glues them together into one piece of text. Test your function briefly.

Solution

This:

wrap <- function(text = "hello", outside = "*") {
  str_c(outside, text, outside)
}

I can run this with the defaults:

wrap()
## [1] "*hello*"

or with text of my choosing:

wrap("goodbye", "_")
## [1] "_goodbye_"

I think that’s what I meant by “test briefly”.

\(\blacksquare\)

  1. What happens if you want to change the default outside but use the default for text? How do you make sure that happens? Explore.

Solution

The obvious thing is this, which doesn’t work:

wrap("!")
## [1] "*!*"

This takes text to be !, and outside to be the default. How do we get outside to be ! instead? The key is to specify the input by name:

wrap(outside = "!")
## [1] "!hello!"

This correctly uses the default for text.

If you specify inputs without names, they are taken to be in the order that they appear in the function definition. As soon as they get out of order, which typically happens by using the default for something early in the list, as we did here for text, you have to specify names for anything that comes after that. These are the names you put on the function’s top line.

You can always use names:

wrap(text = "thing", outside = "**")
## [1] "**thing**"

and if you use names, they don’t even have to be in order:

wrap(outside = "!?", text = "fred")
## [1] "!?fred!?"

\(\blacksquare\)

  1. What happens if you feed your function wrap a vector for either of its arguments? What about if you use it in a mutate?

Solution

Let’s try:

mytext <- c("a", "b", "c")
wrap(text = mytext)
## [1] "*a*" "*b*" "*c*"
myout <- c("*", "!")
wrap(outside = myout)
## [1] "*hello*" "!hello!"

If one of the inputs is a vector, the other one gets “recycled” as many times as the vector is long. What if they’re both vectors?

mytext2 <- c("a", "b", "c", "d")
wrap(mytext2, myout)
## [1] "*a*" "!b!" "*c*" "!d!"

This uses the two inputs in parallel, repeating the short one as needed. But this, though it works, gives a warning:

wrap(mytext, myout)
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## longer object length is not a multiple of shorter object length
## [1] "*a*" "!b!" "*c*"

This is because the shorter vector (of length 2 here) doesn’t go evenly into the longer one (length 3). It gives a warning because this is probably not what you wanted.

The mutate thing is easier, because all the columns in a data frame have to be the same length. LETTERS is a vector with the uppercase letters in it:

tibble(mytext = LETTERS[1:6], myout = c("*", "**", "!", "!!", "_", "__")) %>%
  mutate(newthing = wrap(mytext, myout))
## # A tibble: 6 × 3
##   mytext myout newthing
##   <chr>  <chr> <chr>   
## 1 A      *     *A*     
## 2 B      **    **B**   
## 3 C      !     !C!     
## 4 D      !!    !!D!!   
## 5 E      _     _E_     
## 6 F      __    __F__

\(\blacksquare\)

22.6 The Collatz sequence

The Collatz sequence is a sequence of integers \(x_1, x_2, \ldots\) defined in a deceptively simple way: if \(x_n\) is the current term of the sequence, then \(x_{n+1}\) is defined as \(x_n/2\) if \(x_n\) is even, and \(3x_n+1\) if \(x_n\) is odd. We are interested in understanding how this sequence behaves; for example, what happens to it as \(n\) gets large, for different choices of the first term \(x_1\)? We will explore this numerically with R; the ambitious among you might like to look into the mathematics of it.

  1. What happens to the sequence when it reaches 4? What would be a sensible way of defining where it ends? Explain briefly.

Solution

When the sequence reaches 4 (that is, when its current term is 4), the next term is 2 and the one after that is 1. Then the following term is 4 again (\((3 \times 1)+1\)) and then it repeats indefinitely, \(4, 2, 1, 4, 2, 1, \ldots\). I think a sensible way to define where the sequence ends is to say “when it reaches 1”, since if you start at 2 you’ll never reach 4 (so “when it reaches 4” won’t work), and it seems at least plausible that it will hit the cycle 4, 2, 1 sometime.

\(\blacksquare\)

  1. Write an R function called is_odd that returns TRUE if its input is an odd number and FALSE if it is even (you can assume that the input is an integer and not a decimal number). To do that, you can use the function %% where a %% b is the remainder when a is divided by b. To think about oddness or evenness, consider the remainder when you divide by 2.

Solution

Let’s try this out. For example, 5 is odd and 6 is even, so

5 %% 2
## [1] 1
6 %% 2
## [1] 0

When a number is odd, its remainder on dividing by 2 is 1, and when even, the remainder is 0. There is an additional shortcut here in that 1 is the numeric value of TRUE and 0 of FALSE, so all we have to do is calculate the remainder on dividing by 2, turn it into a logical, and return it:

is_odd <- function(x) {
  r <- x %% 2
  as.logical(r)
}

You probably haven’t seen as.logical before, but it’s the same idea as as.numeric: turn something that looks like a TRUE or FALSE into something that actually is.

We should test it:

is_odd(19)
## [1] TRUE
is_odd(12)
## [1] FALSE
is_odd(0)
## [1] FALSE

0 is usually considered an even number, so this is good.

\(\blacksquare\)

  1. Write an R function called hotpo137 that takes an integer as input and returns the next number in the Collatz sequence. To do this, use the function you just wrote that determines whether a number is even or odd.

Solution

The logic is “if the input is odd, return 3 times it plus 1, otherwise return half of it”. The R structure is an if-then-else:

hotpo1 <- function(x) {
  if (is_odd(x)) 3 * x + 1 else x / 2
}

In R, the condition that is tested goes in brackets, and then if the value-if-true and the value-if-false are single statements, you just type them. (If they are more complicated than that, you put them in curly brackets.) Now you see the value of writing is_odd earlier; this code almost looks like the English-language description of the sequence. If we had not written is_odd before, the condition would have looked something like

if (x %% 2 == 1) 3 * x + 1 else x / 2

which would have been a lot harder to read.

All right, let’s try that out:

hotpo1(4)
## [1] 2
hotpo1(7)
## [1] 22
hotpo1(24)
## [1] 12

That looks all right so far.

\(\blacksquare\)

  1. Now write a function hotpo that will return the whole Collatz sequence for an input \(x_1\). For this, assume that you will eventually get to 1.

Solution

This is a loop, but not a for loop (or something that we could do rowwise), because we don’t know how many times we have to go around. This is the kind of thing that we should use a while loop for: “keep going while a condition is true”. In this case, we should keep going if we haven’t reached 1 yet. If we haven’t reached 1, we should generate the next value of the sequence and glue it onto what we have so far. To initialize the sequence, we start with the input value. There is an R trick to glue a value onto the sequence, which is to use c with a vector and a value, and save it back into the vector:

hotpo <- function(x) {
  sequence <- x
  term <- x
  while (term > 1) {
    term <- hotpo1(term)
    sequence <- c(sequence, term)
  }
  sequence
}

I use term to hold the current term of the sequence, and overwrite it with the next one (since I don’t need the old one any more).

Does it work?

hotpo(4)
## [1] 4 2 1
hotpo(12)
##  [1] 12  6  3 10  5 16  8  4  2  1
hotpo(97)
##   [1]   97  292  146   73  220  110   55  166   83  250  125  376  188   94   47
##  [16]  142   71  214  107  322  161  484  242  121  364  182   91  274  137  412
##  [31]  206  103  310  155  466  233  700  350  175  526  263  790  395 1186  593
##  [46] 1780  890  445 1336  668  334  167  502  251  754  377 1132  566  283  850
##  [61]  425 1276  638  319  958  479 1438  719 2158 1079 3238 1619 4858 2429 7288
##  [76] 3644 1822  911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154  577 1732
##  [91]  866  433 1300  650  325  976  488  244  122   61  184   92   46   23   70
## [106]   35  106   53  160   80   40   20   10    5   16    8    4    2    1

97 is a wild ride, but it does eventually get to 1.

Extra: where I originally saw this, which was “Metamagical Themas” by Douglas Hofstadter, he was illustrating the programming language Lisp and the process of recursion, whereby you define a function in terms of itself. This one is a natural for that, because the Collatz sequence starting at \(x\) is \(x\) along with the Collatz sequence starting at the next term. For example, if you start at 12, the next term is 6, so that the Collatz sequence starting at 12 is 12 followed by the Collatz sequence starting at 6. There is no dependence any further back. You can do recursion in R also; there is no problem with a function calling itself:

hotpo_rec <- function(x) {
  if (x == 1) 1 else c(x, hotpo_rec(hotpo1(x)))
}

Recursive functions have two parts: a “base case” that says how you know you are done (the 1 here), and a “recursion” that says how you move to a simpler case, here working out the next term, getting the whole sequence for that, and gluing the input onto the front. It seems paradoxical that you define a function in terms of itself, but what you are doing is calling a simpler sequence, in this case one that is length one shorter than the sequence for the original input. Thus, we hope,38 we will eventually reach 1.

Does it work?

hotpo_rec(12)
##  [1] 12  6  3 10  5 16  8  4  2  1
hotpo_rec(97)
##   [1]   97  292  146   73  220  110   55  166   83  250  125  376  188   94   47
##  [16]  142   71  214  107  322  161  484  242  121  364  182   91  274  137  412
##  [31]  206  103  310  155  466  233  700  350  175  526  263  790  395 1186  593
##  [46] 1780  890  445 1336  668  334  167  502  251  754  377 1132  566  283  850
##  [61]  425 1276  638  319  958  479 1438  719 2158 1079 3238 1619 4858 2429 7288
##  [76] 3644 1822  911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154  577 1732
##  [91]  866  433 1300  650  325  976  488  244  122   61  184   92   46   23   70
## [106]   35  106   53  160   80   40   20   10    5   16    8    4    2    1

It does.

Recursive functions are often simple to understand, but they are not always very efficient. They can take a lot of memory, because they have to handle the intermediate calls to the function, which they have to save to use later (in the case of hotpo_rec(97) there are a lot of those). Recursive functions are often paired with a technique called “memoization”, where each time you calculate the function’s value, you save it in another array. The first thing you do in the recursive function is to check whether you already have the answer, in which case you just look it up and return it. It was a lot of work here to calculate the sequence from 97, but if we had saved the results, we would already have the answers for 292, 146, 73, 220 and so on, and getting those later would be a table lookup rather than another recursive calculation.

\(\blacksquare\)

  1. Write two (very small) functions that take an entire sequence as input and return (i) the length of the sequence and (ii) the maximum value it attains.

Solution

These are both one-liners. Call the input whatever you like:

hotpo_len <- function(sequence) length(sequence)
hotpo_max <- function(sequence) max(sequence)

Because they are one-liners, you don’t even need the curly brackets, although there’s no problem if they are there.

Testing:

hotpo_len(hotpo(12))
## [1] 10
hotpo_max(hotpo(97))
## [1] 9232

This checks with what we had before.

\(\blacksquare\)

  1. Make a data frame consisting of the values 11 through 20, and, using tidyverse ideas, obtain a data frame containing the Collatz sequences starting at each of those values, along with their lengths and their maximum values. Which sequence is longest? Which one goes up highest?

Solution

This one uses rowwise ideas:39

tibble(x = 11:20) %>%
  rowwise %>% 
  mutate(sequence = list(hotpo(x))) %>%
  mutate(length = hotpo_len(sequence)) %>%
  mutate(high = hotpo_max(sequence))
## # A tibble: 10 × 4
## # Rowwise: 
##        x sequence   length  high
##    <int> <list>      <int> <dbl>
##  1    11 <dbl [15]>     15    52
##  2    12 <dbl [10]>     10    16
##  3    13 <dbl [10]>     10    40
##  4    14 <dbl [18]>     18    52
##  5    15 <dbl [18]>     18   160
##  6    16 <dbl [5]>       5    16
##  7    17 <dbl [13]>     13    52
##  8    18 <dbl [21]>     21    52
##  9    19 <dbl [21]>     21    88
## 10    20 <dbl [8]>       8    20

First, we obtain a list-column containing the sequences (which is why its calculation needs a list around it), then two ordinary columns of their lengths and their maximum values.

The sequences for 18 and 19 are the longest, but the sequence for 15 goes up the highest.

\(\blacksquare\)

22.7 Coefficient of Variation

The coefficient of variation of a vector x is defined as the standard deviation of x divided by the mean of x.

  1. Write a function called cv that calculates the coefficient of variation of its input and returns the result. You should use base R’s functions that reliably compute the pieces that you need.

Solution

I like to make a function skeleton first to be sure I remember all the pieces. You can do this by making a code chunk, typing fun, waiting a moment, and selecting the “snippet” that is offered to you. That gives you this, with your cursor on name:

name <- function(variables) {

}

Give your function a name (I asked you to use cv), hit tab, enter a name like x for the input, hit tab again, and then fill in the body of the function, to end up with something like this:

cv <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}

I think this is best, for reasons discussed below.

Some observations:

  • in the last line of the function, you calculate something, and that something is the thing that gets returned. You do not need to save it in a variable, because then you have to return that variable by having its name alone on the last line.
  • R has a return() function, but it is bad style to use it to return a value at the end of a function. The time to use it is when you test something early in the function and want to return early with a value like zero or missing without doing any actual calculation. Looking ahead a bit, you might want to return “missing” if the mean is zero before dividing by it, which you could do like this, this being good style:
cv2 <- function(x) {
mean_x <- mean(x)
if (mean_x == 0) return(NA)
sd_x <- sd(x)
sd_x/mean_x
}
  • use the built-in mean and sd rather than trying to roll your own, because they have been tested by many users over a long time and they work. That’s what I meant by “pieces” in the question.

  • you might be tempted to do something like this:

cv3 <- function(x) {
sd(x)/mean(x)
}

This will work, but it is not a good habit to get into, and thus not best as an answer to this question. This one line of code says three things: “work out the SD of x”, “work out the mean of x”, and “divide them by each other”. It is much clearer, for anyone else reading the code (including yourself in six months when you have forgotten what you were thinking) to have the three things one per line, so that anyone reading the code sees that each line does one thing and what that one thing is. There is a reason why production code and code golf are two very different things. Code that is easy to read is also easy to maintain, either by you or others.

  • using something other than mean and sd as the names of your intermediate results is a good idea because these names are already used by R (as the names of the functions that compute the mean and SD). Redefining names that R uses can make other code behave unpredictably and cause hard-to-find bugs.40

\(\blacksquare\)

  1. Use your function to find the coefficient of variation of the set of integers 1 through 5.

Solution

This can be as simple as

cv(1:5)
## [1] 0.5270463

or, equally good, define this into a vector first:

y <- 1:5
cv(y)
## [1] 0.5270463

or, displaying a little less understanding, type the five numbers into the vector first (or directly as input to the function):

y <- c(1, 2, 3, 4, 5)
cv(y)
## [1] 0.5270463
cv(c(1, 2, 3, 4, 5))
## [1] 0.5270463

\(\blacksquare\)

  1. Define a vector as follows:
v <- c(-2.8, -1.8, -0.8, 1.2, 4.2)

What is its coefficient of variation, according to your function? Does this make sense? Why did this happen? Explain briefly.

Solution

Try it and see:

cv(v)
## [1] 6.248491e+16

A very large number, much bigger than any of the data values; with these human-sized numbers, we’d expect a human-sized coefficient of variation as well.

What actually happened was that the mean of v is this:

mean(v)
## [1] 4.440892e-17

Zero, or close enough, so that in calculating the coefficient of variation, we divided by something that was (almost) zero and got a result that was (almost) infinite.41

The possibility of getting a zero mean is why most people only calculate a coefficient of variation if all of the numbers are positive, which brings us to the next part:

\(\blacksquare\)

  1. Most people only calculate a coefficient of variation if there are no negative numbers. Rewrite your function so that it gives an error if there are any negative numbers in the input, and test it with the vector v above. Hint: you might need to add error=TRUE to your chunk header to allow your document to preview/knit (inside the curly brackets at the top of the chunk, after a comma).

Solution

This is a case for stopifnot, or of if coupled with stop. Check this up front, as the first thing you do before you calculate anything else. As to what to check, there are several possibilities:

  • stop if any of the numbers are negative
  • continue if all of the numbers are positive
  • stop if the smallest number is negative (the smallest number is negative if and only if not all the numbers are positive)

R has functions any and all that do what you’d expect:

w <- 1:5
w
## [1] 1 2 3 4 5
any(w>3.5)
## [1] TRUE
all(w<4.5)
## [1] FALSE

Are there any numbers greater than 3.5 (yes, 4 and 5); are all the numbers less than 4.5 (no, 5 isn’t).

Cite your sources for these if you use either of them, since this is the first place in the course that I’m mentioning either of them.

Remember that if you use stopifnot, the condition that goes in there is what has to be true if the function is to run; if you use if and stop, the condition is what will stop the function running. With that in mind, I would code my three possibilities above this way. First off, here’s the original:

cv <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}

then, stop if any of the numbers are negative:

cv <- function(x) {
if (any(x<0)) stop("A value is negative")
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): A value is negative

continue if all the numbers are positive

cv <- function(x) {
stopifnot(all(x>0))
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): all(x > 0) is not TRUE

stop if the smallest value is negative

cv <- function(x) {
if (min(x)<0) stop("Smallest value is negative")
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): Smallest value is negative

There are (at least) three other possibilities: you can negate the logical condition and interchange if/stop and stopifnot, thus (at the expense of some clarity of reading):

continue if it is not true that any of the numbers are negative

cv <- function(x) {
stopifnot(!any(x<0))
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): !any(x < 0) is not TRUE

(you might be thinking of De Morgan’s laws here)

stop if it is not true that all the numbers are positive

cv <- function(x) {
if (!all(x>0)) stop("Not all values are positive")
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): Not all values are positive

continue if the smallest value is not negative

cv <- function(x) {
stopifnot(min(x)>=0)
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): min(x) >= 0 is not TRUE

or another way to do the last one, a more direct negation of the condition, which at my guess needs some extra brackets:

cv <- function(x) {
stopifnot(!(min(x)<0))
mean_x <- mean(x)
sd_x <- sd(x)
sd_x/mean_x
}
cv(v)
## Error in cv(v): !(min(x) < 0) is not TRUE

This one is hard to parse: what does that last message mean? I would take a negative off each side and read it as “min of x is negative is TRUE”, but that takes extra effort.

I said that last one needed some extra brackets. This is, I thought, to get the order of operations right (operator precedence); it turns out not to matter because “not” has lower precedence than most other things, so that these do actually work (the “not” is evaluated after the less-than and the other things, so last of all here, even though it appears to be “glued” to the min):

!min(v)<0
## [1] FALSE
!min(1:5)<0
## [1] TRUE

See this for details. See especially the second set of examples, the ones beginning with “Special operators”, and see especially-especially the comment at the bottom of these examples! That is to say, you should put in the extra brackets unless you also make the case that they are not needed, because anyone reading your code is guaranteed to be confused by it when they read it (including you in six months, because you will not remember the operator priority of “not”).

My take is that one of the first three of the seven possibilities for coding stopifnot or if with stop is the best, since these more obviously encode the condition for continuing or stopping as appropriate. There are two things here: one is that you have to get the code right, but the second is that you have to get the code clear, so that it is obvious to anyone reading it that it does the right thing (once again, this includes you in six months). On that score, the first three alternatives are a direct expression of what you want to achieve, and the last four make it look as if you found a way of coding it that worked and stopped there, without thinking about whether there were any other, clearer or more expressive, possibilities.

\(\blacksquare\)

22.8 Rescaling

Suppose we have this vector of values:

z <- c(10, 14, 11)
z
## [1] 10 14 11

We want to scale these so that the smallest value is 0 and the largest is 1. We are going to be doing this a lot, so we are going to write a function that will work for any input.

  1. Using a copy of my z, work out min(z) and max(z). What do they do? Explain (very) briefly.

Solution

Simply this – define z first:

z <- c(10, 14, 11)
min(z)
## [1] 10
max(z)
## [1] 14

They are respectively (and as you would guess) the smallest and largest of the values in z. (A nice gentle warmup, but I wanted to get you on the right track for what is coming up.)

\(\blacksquare\)

  1. What do these lines of code do, using the same z that I had? Run them and see, and describe briefly what s contains.
lo <- min(z)
hi <- max(z)
s <- (z - lo) / (hi - lo)
s

Solution

Here we go:

lo <- min(z)
hi <- max(z)
s <- (z - lo) / (hi - lo)
s
## [1] 0.00 1.00 0.25

The lowest value became 0, the highest 1, and the other one to something in between. Saying this shows the greatest insight.

Extra: the reason for doing it in three steps rather than one (see below) is (i) it makes it clearer what is going on (and thus makes it less likely that you make a mistake), and (ii) it is more efficient, since my way only finds the minimum once instead of twice. Compare this approach with mine above:

(z - min(z)) / (max(z) - min(z))
## [1] 0.00 1.00 0.25

More complicated with all the brackets, and two min(z). Admittedly the difference here will be thousandths of a second, but why call a function twice when you don’t have to?

\(\blacksquare\)

  1. Write a function called rescale that implements the calculation above, for any input vector called x. (Note that I changed the name.)

Solution

Write a function skeleton:

rescale <- function(x) {

}

and inside the curly brackets put the code from above, replacing z with x everywhere:

rescale <- function(x) {
lo <- min(x)
hi <- max(x)
(x - lo) / (hi - lo)
}

You’ll need to make sure your function returns something to the outside world. Either don’t save the last line in s (as I did here), or save it in s and then return s:

rescale <- function(x) {
lo <- min(x)
hi <- max(x)
s <- (x - lo) / (hi - lo)
s
}

or use return() if you must, but be aware that this is bad style in R (unlike Python, where you need it). The approved way of using return in R is when you are returning something earlier than the last line of a function, for example, you are testing a simple case first and returning the value that goes with that, before getting down to the serious computation.

Extra: in the spirit of what’s coming up below, you might check first whether the maximum and minimum are the same and return something else if that’s the case:

rescale0 <- function(x) {
lo <- min(x)
hi <- max(x)
if (lo == hi) return(0)
s <- (x - lo) / (hi - lo)
s
}

This is good style; in this case, if lo and hi are the same, we want to return something else (zero) to the outside world, and then we do the calculation, knowing that lo and hi are different, so that we are sure we are not dividing by zero (but I get ahead of myself).

Doing it this way, there is something to be careful of: a function ought to return a predictable type of thing: numbers, in this case. If you have your function return text on error, like this:

rescale0 <- function(x) {
lo <- min(x)
hi <- max(x)
if (lo == hi) return("High and low need to be different")
s <- (x - lo) / (hi - lo)
s
}

then you can get into trouble:

rescale0(z)
## [1] 0.00 1.00 0.25
rescale0(c(3,3,3))
## [1] "High and low need to be different"

The first one is numbers and the second one is text.

\(\blacksquare\)

  1. Test your function on my z, and on another vector of your choosing. Explain briefly why the answer you get from your vector makes sense.

Solution

On my z:

rescale(z)
## [1] 0.00 1.00 0.25

The same values as I got before, so that works.

For your vector, use whatever you like. I think it makes sense to have the values already in order, to make it easier to check. Here’s one possibility:

w <- 2:6
w
## [1] 2 3 4 5 6

and then

rescale(w)
## [1] 0.00 0.25 0.50 0.75 1.00

The smallest value is 2, which goes to zero; the largest is 6, which goes to 1, and the others are equally spaced between in both the input and the output.

Another possibility is to use a vector with values whose largest and smallest you can clearly see:

w <- c(10, 11, 100, 0, 20)
rescale(w)
## [1] 0.10 0.11 1.00 0.00 0.20

Clearly the smallest value is 0 and the largest is 100. These become 0 and 1, and these particular values make it easy to see what happened: each of the other values got divided by 100.

Some discussion is needed here, in that you need to say something convincing about why your answer is right.

Extra: This is why I had you use a name other than z for the input to your function. The function can be used on any input, not just the z that we tested it on. There’s another R-specific reason, which is that you need to be careful about using the named inputs only. Consider this function:

ff <- function(x) {
x + z
}

ff(10)
## [1] 20 24 21

Where did z come from? R used the z we had before, which is rather dangerous: what if we had a z lying around from some completely different work? Much better to have a function work with only inputs in the top line:

ff <- function(x, z) {
x + z
}
ff(10, 3)
## [1] 13
ff(10)
## Error in ff(10): argument "z" is missing, with no default

The first time, the two inputs are added together, but the second time it tells you it was expecting a value to use for z and didn’t see one. Much safer.

\(\blacksquare\)

  1. What happens if your input to rescale is a vector of numbers all the same? Give an example. Rewrite your function to intercept this case and give a helpful error message.

Solution

First, try it and see. Any collection of values all the same will do:

rescale(c(3,3,3))
## [1] NaN NaN NaN

NaN stands for “not a number”. The way we got it is that the minimum and maximum were the same, so our function ended up dividing by zero (in fact, working out zero divided by zero). This is, in R terms, not even an error, but the answer is certainly not helpful.

The easiest way to check inputs is to use stopifnot to express what should be true if the function is to proceed. Here, we want the maximum and minimum to be different, so:

rescale <- function(x) {
lo <- min(x)
hi <- max(x)
stopifnot(hi != lo)
(x - lo) / (hi - lo)
}
rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): hi != lo is not TRUE

This is much clearer: I only have to recall what my hi and lo are to see what the problem is.

Extra 1: by calculating and saving the min and max up front, I still only need to calculate them once. If you do it this way:

rescale <- function(x) {
stopifnot(max(x) != min(x))
(x - min(x)) / (max(x) - min(x))
}
rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): max(x) != min(x) is not TRUE

you get a slightly more informative error message, but you have calculated the max twice and the min three times for no reason.

Extra 2: stopifnot is shorthand for this:

rescale <- function(x) {
lo <- min(x)
hi <- max(x)
if (hi == lo) stop("min and max are the same!")
(x - lo) / (hi - lo)
}
rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): min and max are the same!

I didn’t show you this, so if you use stop, you must tell me where you found out about it. This is better than returning some text (see rescale0 above) or printing a message: it’s an error, so you want to make it look like an error. I am very sympathetic to being persuaded that this is better than stopifnot, because you can customize the message (and, also, you don’t have to go through the double-negative contortions of stopifnot). Another way to use stopifnot and get a customized message is this one (that I only learned about right when you were writing this Assignment):

rescale <- function(x) {
lo <- min(x)
hi <- max(x)
stopifnot("high and low must be different" = (hi != lo))
(x - lo) / (hi - lo)
}
rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): high and low must be different

This is called a “named argument”, and the name, if given, is used as an error message.

Extra 3: returning to my rescale0 from above:

rescale0
## function(x) {
## lo <- min(x)
## hi <- max(x)
## if (lo == hi) return("High and low need to be different")
## s <- (x - lo) / (hi - lo)
## s
## }
## <bytecode: 0x560e93136430>

this can get you into trouble if you use it in a dataframe. This is a bit complicated, since it has to use list-columns. Here we go:

tibble(x = list(z, c(3,3,3)))
## # A tibble: 2 × 1
##   x        
##   <list>   
## 1 <dbl [3]>
## 2 <dbl [3]>

Just to check that this does contain what you think it does:

tibble(x = list(z, c(3,3,3))) %>% unnest(x)
## # A tibble: 6 × 1
##       x
##   <dbl>
## 1    10
## 2    14
## 3    11
## 4     3
## 5     3
## 6     3

So now, for each of those two input vectors, what happens when we run rescale0 on them? This is rowwise:

tibble(x = list(z, c(3,3,3))) %>% 
  rowwise() %>% 
  mutate(ans = list(rescale0(x)))
## # A tibble: 2 × 2
## # Rowwise: 
##   x         ans      
##   <list>    <list>   
## 1 <dbl [3]> <dbl [3]>
## 2 <dbl [3]> <chr [1]>

The first ans is a vector of 3 numbers, and the second one is one piece of text (the “error message”). I was actually surprised it got this far. So what happens when we unnest the second column?

tibble(x = list(z, c(3,3,3))) %>% 
  rowwise() %>% 
  mutate(ans = list(rescale0(x))) %>% 
  unnest(ans)
## Error in `list_unchop()`:
## ! Can't combine `x[[1]]` <double> and `x[[2]]` <character>.

Now we get a confusing error: it’s here that combining some numbers and some text in one column of a dataframe doesn’t work. To forestall this, we need to go back and rewrite rescale0 to not mix things up. Having it return an error, as the latest version of rescale does, gives an error here too, but at least we know what it means:

tibble(x = list(z, c(3,3,3))) %>% 
  rowwise() %>% 
  mutate(ans = list(rescale(x))) %>% 
  unnest(ans)
## Error in `mutate()`:
## ! Problem while computing `ans = list(rescale(x))`.
## ℹ The error occurred in row 2.
## Caused by error in `rescale()`:
## ! high and low must be different

because this is the error we anticipated: it says “somewhere within the list-column x, specifically in its second row, is a vector where everything is the same”.

\(\blacksquare\)

  1. Make a dataframe (containing any numeric values), and in it create a new column containing the rescaled version of one of its columns, using your function. Show your result.

Solution

This is less difficult than you might be expecting: make a dataframe with at least one numeric column, and use mutate:

d <- tibble(y=2:6)
d
## # A tibble: 5 × 1
##       y
##   <int>
## 1     2
## 2     3
## 3     4
## 4     5
## 5     6

and then

d %>% mutate(s=rescale(y))
## # A tibble: 5 × 2
##       y     s
##   <int> <dbl>
## 1     2  0   
## 2     3  0.25
## 3     4  0.5 
## 4     5  0.75
## 5     6  1

You can supply the values for what I called y, or use random numbers. It’s easier for you to check that it has worked if your column playing the role of my y has not too many values in it.

Extra: this is actually already in the tidyverse under the name percent_rank (“percentile ranks”):

d %>% mutate(s = percent_rank(y))
## # A tibble: 5 × 2
##       y     s
##   <int> <dbl>
## 1     2  0   
## 2     3  0.25
## 3     4  0.5 
## 4     5  0.75
## 5     6  1

The value 5, for example, is at the 75th percentile.

\(\blacksquare\)

  1. We might want to rescale the input not to be between 0 and 1, but between two values a and b that we specify as input. If a and/or b are not given, we want to use the values 0 for a and 1 for b. Rewrite your function to rescale the input to be between a and b instead of 0 and 1. Hint: allow your function to produce values between 0 and 1 as before, and then note that if all the values in a vector s are between 0 and 1, then all the values in a+(b-a)*s are between \(a\) and \(b\).

Solution

I’m showing you my thought process in this one. The answer I want from you is the one at the end.

So, start by copying and pasting what you had before:

rescale <- function(x) {
lo <- min(x)
hi <- max(x)
stopifnot(hi != lo)
(x - lo) / (hi - lo)
}

On the top line, add the extra inputs and their default values. I also changed the name of my function, for reasons you’ll see later:

rescale2 <- function(x, a=0, b=1) {
lo <- min(x)
hi <- max(x)
stopifnot(hi != lo)
(x - lo) / (hi - lo)
}

Save the last line, since we have to do something else with it:

rescale2 <- function(x, a=0, b=1) {
lo <- min(x)
hi <- max(x)
stopifnot(hi != lo)
s <- (x - lo) / (hi - lo)
}

and finally add the calculation in the hint, which we don’t need to save because we are returning it:

rescale2 <- function(x, a=0, b=1) {
lo <- min(x)
hi <- max(x)
stopifnot(hi != lo)
s <- (x - lo) / (hi - lo)
a + (b-a) * s
}

This complete function is what I want to see from you. (You should keep the stopifnot, because this function will have the exact same problem as the previous one if all the values in x are the same.)

A better way is to observe that you can call functions inside functions. The function above is now a bit messy since it has several steps. Something that corresponds better to my hint is to call the original rescale first, and then modify its result:

rescale3 <- function(x, a=0, b=1) {
s <- rescale(x)
a + (b-a) * s
}

The logic to this is rather clearly “rescale the input to be between 0 and 1, then rescale that to be between \(a\) and \(b\).” My rescale2 does exactly the same thing, but it’s much less clear that it does so, unless you happen to have in your head how rescale works. (I think you are more likely to remember, sometime in the future, what rescale does, compared to precisely how it works.)

That is why rescale3 is better than rescale2. Remember that you can, and generally should, use functions that have already been written (by you or someone else) as part of functions that do more complex things. See also my second point below.

Extra: there are two important principles about why functions are important:

  1. they allow you to re-do a calculation on many different inputs (the point I’ve been making up to now)
  2. by abstracting a calculation into a thing with a name, it makes it easier to understand that calculation’s role in something bigger. The only thing we had to remember in rescale3 is what the last line did, because the name of the function called on the first line tells us what happens there. This is much easier than remembering what the first four lines of rescale2 do.

The second principle here is what psychologists call “chunking”: you view a thing like my function rescale as a single item, rather than as four separate lines of code, and then that single item can be part of something larger (like my rescale3), and you have a smaller number of things to keep track of.

\(\blacksquare\)

  1. Test your new function two or more times, on input where you know or can guess what the output is going to be. In each case, explain briefly why your output makes sense.

Solution

I’ll start by using the default values for a and b (so I don’t have to specify them):

rescale2(2:6)
## [1] 0.00 0.25 0.50 0.75 1.00
rescale3(2:6)
## [1] 0.00 0.25 0.50 0.75 1.00

I did both of the variants of my function; of course, you’ll only have one variant.

We got the same answer as before for the same input, so the default values \(a=0, b=1\) look as if they have been used.

Let’s try a different one:

v <- c(7, 11, 12)
rescale2(v, 10, 30)
## [1] 10 26 30
rescale3(v, 10, 30)
## [1] 10 26 30

The lowest value in v has become 10, and the highest has become 30. (Also, the in-between value 11 was closer to 12 than to 7, and it has become something closer to 30 than to 10.)

Extra: you can also name any of your inputs:

rescale3(x=v, a=10, b=30)
## [1] 10 26 30

and if you name them, you can shuffle the order too:

rescale3(a=10, b=30, x=v)
## [1] 10 26 30

The point of doing more than one test is to check that different aspects of your function all work. Therefore, the best testing here checks that the defaults work, and that the answer is sensible for some different a and b (to check that this works as well).

When you write your version of rescale with the optional inputs, it’s best if you do it so that the things you have to supply (the vector of numbers) is first. If you put a and b first, when you want to omit them, you’ll have to call the input vector by name, like this:

rescale3(x=v)
## [1] 0.0 0.8 1.0

because otherwise the input vector will be taken to be a, not what you want.

\(\blacksquare\)


  1. Hotpo is short for half or triple-plus-one.↩︎

  2. R style is to use the last line of the function for the return value, unless you are jumping out of the function before the end, in which case use return.↩︎

  3. Hotpo is short for half or triple-plus-one.↩︎

  4. Nobody knows whether you always get to 1, but also nobody has ever found a case where you don’t. Collatz’s conjecture, that you will get to 1 eventually, is known to be true for all starting \(x_1\) up to some absurdly large number, but not for all starting points.↩︎

  5. I should have been more careful in my functions to make sure everything was integers, and, in particular, to do integer division by 2 because I knew that this division was going to come out even.↩︎

  6. This is something I’ve done in the past and been bitten by, so I am trying to get myself not to do it any more.↩︎

  7. In computing with decimal numbers, things are almost never exactly zero or exactly infinite; they are very small or very big. The mean here, which you would calculate to be zero, is less than the so-called machine epsilon, which is about 10 to the minus 16 in R (R works in double precision). A mean that small is, to the computer, indistinguishable from zero. It came out that way because the last value is added to a total that is by that point negative, and so you have a loss of accuracy because of subtracting nearly equal quantities. I learned all about this stuff in my first real computer science course, which I think was a numerical math course, some absurd number of years ago. It looks as if this gets taught in CSCC37 these days.↩︎