Chapter 22 Functions
library(tidyverse)
22.1 Making some R functions
Let’s write some simple R functions to convert temperatures, and later to play with text.
A temperature in Celsius is converted to one in Kelvin by adding 273.15. (A temperature of \(-273.15\) Celsius, 0 Kelvin, is the “absolute zero” temperature that nothing can be colder than.) Write a function called
c_to_k
that converts an input Celsius temperature to one in Kelvin, and test that it works.Write a function to convert a Fahrenheit temperature to Celsius. The way you do that is to subtract 32 and then multiply by \(5/9\).
Using the functions you already wrote, write a function to convert an input Fahrenheit temperature to Kelvin.
Rewrite your Fahrenheit-to-Celsius convertor to take a suitable default value and check that it works as a default.
What happens if you feed your Fahrenheit-to-Celsius convertor a vector of Fahrenheit temperatures? What if you use it in a
mutate
?Write another function called
wrap
that takes two arguments: a piece of text calledtext
, which defaults tohello
, and another piece of text calledoutside
, which defaults to*
. The function returnstext
with the textoutside
placed before and after, so that calling the function with the defaults should return*hello*
. To do this, you can usestr_c
fromstringr
(loaded with thetidyverse
) which places its text arguments side by side and glues them together into one piece of text. Test your function briefly.What happens if you want to change the default
outside
but use the default fortext
? How do you make sure that happens? Explore.What happens if you feed your function
wrap
a vector for either of its arguments? What about if you use it in amutate
?
22.2 The Collatz sequence
The Collatz sequence is a sequence of integers \(x_1, x_2, \ldots\) defined in a deceptively simple way: if \(x_n\) is the current term of the sequence, then \(x_{n+1}\) is defined as \(x_n/2\) if \(x_n\) is even, and \(3x_n+1\) if \(x_n\) is odd. We are interested in understanding how this sequence behaves; for example, what happens to it as \(n\) gets large, for different choices of the first term \(x_1\)? We will explore this numerically with R; the ambitious among you might like to look into the mathematics of it.
What happens to the sequence when it reaches 4? What would be a sensible way of defining where it ends? Explain briefly.
Write an R function called
is_odd
that returnsTRUE
if its input is an odd number andFALSE
if it is even (you can assume that the input is an integer and not a decimal number). To do that, you can use the function%%
wherea %% b
is the remainder whena
is divided byb
. To think about oddness or evenness, consider the remainder when you divide by 2.Write an R function called
hotpo1
35 that takes an integer as input and returns the next number in the Collatz sequence. To do this, use the function you just wrote that determines whether a number is even or odd.Now write a function
hotpo
that will return the whole Collatz sequence for an input \(x_1\). For this, assume that you will eventually get to 1.Write two (very small) functions that take an entire sequence as input and return (i) the length of the sequence and (ii) the maximum value it attains.
Make a data frame consisting of the values 11 through 20, and, using
tidyverse
ideas, obtain a data frame containing the Collatz sequences starting at each of those values, along with their lengths and their maximum values. Which sequence is longest? Which one goes up highest?
22.3 Coefficient of Variation
The coefficient of variation of a vector x
is defined as the standard deviation of x
divided by the mean of x
.
Write a function called
cv
that calculates the coefficient of variation of its input and returns the result. You should use base R’s functions that reliably compute the pieces that you need.Use your function to find the coefficient of variation of the set of integers 1 through 5.
Define a vector as follows:
<- c(-2.8, -1.8, -0.8, 1.2, 4.2) v
What is its coefficient of variation, according to your function? Does this make sense? Why did this happen? Explain briefly.
- Most people only calculate a coefficient of variation if there are no negative numbers. Rewrite your function so that it gives an error if there are any negative numbers in the input, and test it with the vector
v
above. Hint: you might need to adderror=TRUE
to your chunk header to allow your document to preview/knit (inside the curly brackets at the top of the chunk, after a comma).
22.4 Rescaling
Suppose we have this vector of values:
<- c(10, 14, 11)
z z
## [1] 10 14 11
We want to scale these so that the smallest value is 0 and the largest is 1. We are going to be doing this a lot, so we are going to write a function that will work for any input.
Using a copy of my
z
, work outmin(z)
andmax(z)
. What do they do? Explain (very) briefly.What do these lines of code do, using the same
z
that I had? Run them and see, and describe briefly whats
contains.
<- min(z)
lo <- max(z)
hi <- (z - lo) / (hi - lo)
s s
Write a function called
rescale
that implements the calculation above, for any input vector calledx
. (Note that I changed the name.)Test your function on my
z
, and on another vector of your choosing. Explain briefly why the answer you get from your vector makes sense.What happens if your input to
rescale
is a vector of numbers all the same? Give an example. Rewrite your function to intercept this case and give a helpful error message.Make a dataframe (containing any numeric values), and in it create a new column containing the rescaled version of one of its columns, using your function. Show your result.
We might want to rescale the input not to be between 0 and 1, but between two values
a
andb
that we specify as input. Ifa
and/orb
are not given, we want to use the values 0 fora
and 1 forb
. Rewrite your function to rescale the input to be betweena
andb
instead of 0 and 1. Hint: allow your function to produce values between 0 and 1 as before, and then note that if all the values in a vectors
are between 0 and 1, then all the values ina+(b-a)*s
are between \(a\) and \(b\).Test your new function two or more times, on input where you know or can guess what the output is going to be. In each case, explain briefly why your output makes sense.
My solutions follow:
22.5 Making some R functions
Let’s write some simple R functions to convert temperatures, and later to play with text.
- A temperature in Celsius is converted to one in Kelvin by
adding 273.15. (A temperature of \(-273.15\) Celsius, 0 Kelvin, is the “absolute zero”
temperature that nothing can be colder than.) Write a
function called
c_to_k
that converts an input Celsius temperature to one in Kelvin, and test that it works.
Solution
This is mostly an exercise in structuring your function
correctly. Let’s call the input C
(uppercase C, since
lowercase c has a special meaning to R):
<- function(C) {
c_to_k + 273.15
C
}c_to_k(0)
## [1] 273.15
c_to_k(20)
## [1] 293.15
This is the simplest way to do it: the last line of the function, if calculated but not saved, is the value that gets returned to the outside world. The checks suggest that it worked.
If you’re used to Python or similar, you might prefer to calculate the value to be returned and then return it. You can do that in R too:
<- function(C) {
c_to_k <- C + 273.15
K return(K)
}c_to_k(0)
## [1] 273.15
c_to_k(20)
## [1] 293.15
That works just as well, and for the rest of this question, you can go either way.36
\(\blacksquare\)
- Write a function to convert a Fahrenheit temperature to Celsius. The way you do that is to subtract 32 and then multiply by \(5/9\).
Solution
On the model of the previous one, we should call this
f_to_c
. I’m going to return the last line, but you can
save the calculated value and return that instead:
<- function(F) {
f_to_c - 32) * 5 / 9
(F
}f_to_c(32)
## [1] 0
f_to_c(50)
## [1] 10
f_to_c(68)
## [1] 20
Americans are very good at saying things like “temperatures in the 50s”, which don’t mean much to me, so I like to have benchmarks to work with: these are the Fahrenheit versions of 0, 10, and 20 Celsius.
Thus “in the 50s” means “between about 10 and 15 Celsius”.
\(\blacksquare\)
- Using the functions you already wrote, write a function to convert an input Fahrenheit temperature to Kelvin.
Solution
This implies that you can piggy-back on the functions you just wrote, which goes as below. First you convert the Fahrenheit to Celsius, and then you convert that to Kelvin. (This is less error-prone than trying to use algebra to get a formula for this conversion and then implementing that):
<- function(F) {
f_to_k <- f_to_c(F)
C <- c_to_k(C)
K return(K)
}f_to_k(32)
## [1] 273.15
f_to_k(68)
## [1] 293.15
These check because in Celsius they are 0 and 20 and we found the Kelvin equivalents of those to be these values earlier.
I wrote this one with a return
because I thought it made the
structure clearer: run one function, save the result, run another
function, save the result, then return what you’ve got.
\(\blacksquare\)
- Rewrite your Fahrenheit-to-Celsius convertor to take a suitable default value and check that it works as a default.
Solution
You can choose any default you like. I’ll take a default of 68 (what I would call “a nice day”):
<- function(F = 68) {
f_to_c - 32) * 5 / 9
(F
}f_to_c(68)
## [1] 20
f_to_c()
## [1] 20
The change is in the top line of the function. You see the result: if we run it without an input, we get the same answer as if the input had been 68.
\(\blacksquare\)
- What happens if you feed your Fahrenheit-to-Celsius convertor
a vector of Fahrenheit temperatures? What if you use it in a
mutate
?
Solution
Try it and see:
<- seq(30, 80, 10)
temps temps
## [1] 30 40 50 60 70 80
f_to_c(temps)
## [1] -1.111111 4.444444 10.000000 15.555556 21.111111 26.666667
Each of the Fahrenheit temperatures gets converted into a Celsius one. This is perhaps more useful in a data frame, thus:
tibble(temps = seq(30, 80, 10)) %>%
mutate(celsius = f_to_c(temps))
## # A tibble: 6 × 2
## temps celsius
## <dbl> <dbl>
## 1 30 -1.11
## 2 40 4.44
## 3 50 10
## 4 60 15.6
## 5 70 21.1
## 6 80 26.7
All the temperatures are side-by-side with their equivalents.
Here’s another way to do the above:
<- seq(30, 80, 10)
temps %>%
temps enframe(value = "fahrenheit") %>%
mutate(celsius = f_to_c(temps))
## # A tibble: 6 × 3
## name fahrenheit celsius
## <int> <dbl> <dbl>
## 1 1 30 -1.11
## 2 2 40 4.44
## 3 3 50 10
## 4 4 60 15.6
## 5 5 70 21.1
## 6 6 80 26.7
enframe
creates a two-column data frame out of a vector (like
temps
). A vector can have “names”, in which case they’ll be
used as the name
column; the values will go in a column
called value
unless you rename it, as I did.
\(\blacksquare\)
- Write another function called
wrap
that takes two arguments: a piece of text calledtext
, which defaults tohello
, and another piece of text calledoutside
, which defaults to*
. The function returnstext
with the textoutside
placed before and after, so that calling the function with the defaults should return*hello*
. To do this, you can usestr_c
fromstringr
(loaded with thetidyverse
) which places its text arguments side by side and glues them together into one piece of text. Test your function briefly.
Solution
This:
<- function(text = "hello", outside = "*") {
wrap str_c(outside, text, outside)
}
I can run this with the defaults:
wrap()
## [1] "*hello*"
or with text of my choosing:
wrap("goodbye", "_")
## [1] "_goodbye_"
I think that’s what I meant by “test briefly”.
\(\blacksquare\)
- What happens if you want to change the default
outside
but use the default fortext
? How do you make sure that happens? Explore.
Solution
The obvious thing is this, which doesn’t work:
wrap("!")
## [1] "*!*"
This takes text
to be !
, and outside
to be the default. How do we get outside
to be !
instead? The key is to specify the input by name:
wrap(outside = "!")
## [1] "!hello!"
This correctly uses the default for text
.
If you specify inputs without names, they are
taken to be in the order that they appear in the function
definition. As soon as they get out of order, which typically happens
by using the default for something early in the list, as we did here
for text
, you have to specify names for anything that comes
after that. These are the names you put on the function’s top line.
You can always use names:
wrap(text = "thing", outside = "**")
## [1] "**thing**"
and if you use names, they don’t even have to be in order:
wrap(outside = "!?", text = "fred")
## [1] "!?fred!?"
\(\blacksquare\)
- What happens if you feed your function
wrap
a vector for either of its arguments? What about if you use it in amutate
?
Solution
Let’s try:
<- c("a", "b", "c")
mytext wrap(text = mytext)
## [1] "*a*" "*b*" "*c*"
<- c("*", "!")
myout wrap(outside = myout)
## [1] "*hello*" "!hello!"
If one of the inputs is a vector, the other one gets “recycled” as many times as the vector is long. What if they’re both vectors?
<- c("a", "b", "c", "d")
mytext2 wrap(mytext2, myout)
## [1] "*a*" "!b!" "*c*" "!d!"
This uses the two inputs in parallel, repeating the short one as needed. But this, though it works, gives a warning:
wrap(mytext, myout)
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## longer object length is not a multiple of shorter object length
## [1] "*a*" "!b!" "*c*"
This is because the shorter vector (of length 2 here) doesn’t go evenly into the longer one (length 3). It gives a warning because this is probably not what you wanted.
The mutate
thing is easier, because all the columns in a data
frame have to be the same length. LETTERS
is a vector with
the uppercase letters in it:
tibble(mytext = LETTERS[1:6], myout = c("*", "**", "!", "!!", "_", "__")) %>%
mutate(newthing = wrap(mytext, myout))
## # A tibble: 6 × 3
## mytext myout newthing
## <chr> <chr> <chr>
## 1 A * *A*
## 2 B ** **B**
## 3 C ! !C!
## 4 D !! !!D!!
## 5 E _ _E_
## 6 F __ __F__
\(\blacksquare\)
22.6 The Collatz sequence
The Collatz sequence is a sequence of integers \(x_1, x_2, \ldots\) defined in a deceptively simple way: if \(x_n\) is the current term of the sequence, then \(x_{n+1}\) is defined as \(x_n/2\) if \(x_n\) is even, and \(3x_n+1\) if \(x_n\) is odd. We are interested in understanding how this sequence behaves; for example, what happens to it as \(n\) gets large, for different choices of the first term \(x_1\)? We will explore this numerically with R; the ambitious among you might like to look into the mathematics of it.
- What happens to the sequence when it reaches 4? What would be a sensible way of defining where it ends? Explain briefly.
Solution
When the sequence reaches 4 (that is, when its current term is 4), the next term is 2 and the one after that is 1. Then the following term is 4 again (\((3 \times 1)+1\)) and then it repeats indefinitely, \(4, 2, 1, 4, 2, 1, \ldots\). I think a sensible way to define where the sequence ends is to say “when it reaches 1”, since if you start at 2 you’ll never reach 4 (so “when it reaches 4” won’t work), and it seems at least plausible that it will hit the cycle 4, 2, 1 sometime.
\(\blacksquare\)
- Write an R function called
is_odd
that returnsTRUE
if its input is an odd number andFALSE
if it is even (you can assume that the input is an integer and not a decimal number). To do that, you can use the function%%
wherea %% b
is the remainder whena
is divided byb
. To think about oddness or evenness, consider the remainder when you divide by 2.
Solution
Let’s try this out. For example, 5 is odd and 6 is even, so
5 %% 2
## [1] 1
6 %% 2
## [1] 0
When a number is odd, its remainder on dividing by 2 is 1, and when
even, the remainder is 0. There is an additional shortcut here in that
1 is the numeric value of TRUE
and 0 of FALSE
, so
all we have to do is calculate the remainder on dividing by 2, turn it
into a logical
, and return it:
<- function(x) {
is_odd <- x %% 2
r as.logical(r)
}
You probably haven’t seen as.logical
before, but it’s the
same idea as as.numeric
: turn something that looks like a
TRUE
or FALSE
into something that actually
is.
We should test it:
is_odd(19)
## [1] TRUE
is_odd(12)
## [1] FALSE
is_odd(0)
## [1] FALSE
0 is usually considered an even number, so this is good.
\(\blacksquare\)
- Write an R function called
hotpo1
37 that takes an integer as input and returns the next number in the Collatz sequence. To do this, use the function you just wrote that determines whether a number is even or odd.
Solution
The logic is “if the input is odd, return 3 times it plus 1, otherwise return half of it”.
The R structure is an
if-then-else
:
<- function(x) {
hotpo1 if (is_odd(x)) 3 * x + 1 else x / 2
}
In R, the condition that is tested goes in brackets, and then if the
value-if-true and the value-if-false are single statements, you just
type them. (If they are more complicated than that, you put them in
curly brackets.) Now you see the value of writing is_odd
earlier; this code almost looks like the English-language description
of the sequence. If we had not written is_odd
before, the
condition would have looked something like
if (x %% 2 == 1) 3 * x + 1 else x / 2
which would have been a lot harder to read.
All right, let’s try that out:
hotpo1(4)
## [1] 2
hotpo1(7)
## [1] 22
hotpo1(24)
## [1] 12
That looks all right so far.
\(\blacksquare\)
- Now write a function
hotpo
that will return the whole Collatz sequence for an input \(x_1\). For this, assume that you will eventually get to 1.
Solution
This is a loop, but not a for
loop (or something that we
could do rowwise
), because we don’t know how many
times we have to go around. This is the kind of thing that we
should use a while
loop for:
“keep going while a condition is true”.
In this case, we should keep going if we haven’t reached 1 yet. If
we haven’t reached 1, we should generate the next value of the
sequence and glue it onto what we have so far. To initialize the
sequence, we start with the input value. There is an R trick to
glue a value onto the sequence, which is to use c
with a
vector and a value, and save it back into the vector:
<- function(x) {
hotpo <- x
sequence <- x
term while (term > 1) {
<- hotpo1(term)
term <- c(sequence, term)
sequence
}
sequence }
I use term
to hold the current term of the sequence, and
overwrite it with the next one (since I don’t need the old one any
more).
Does it work?
hotpo(4)
## [1] 4 2 1
hotpo(12)
## [1] 12 6 3 10 5 16 8 4 2 1
hotpo(97)
## [1] 97 292 146 73 220 110 55 166 83 250 125 376 188 94 47
## [16] 142 71 214 107 322 161 484 242 121 364 182 91 274 137 412
## [31] 206 103 310 155 466 233 700 350 175 526 263 790 395 1186 593
## [46] 1780 890 445 1336 668 334 167 502 251 754 377 1132 566 283 850
## [61] 425 1276 638 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288
## [76] 3644 1822 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732
## [91] 866 433 1300 650 325 976 488 244 122 61 184 92 46 23 70
## [106] 35 106 53 160 80 40 20 10 5 16 8 4 2 1
97 is a wild ride, but it does eventually get to 1.
Extra: where I originally saw this, which was “Metamagical Themas” by Douglas Hofstadter, he was illustrating the programming language Lisp and the process of recursion, whereby you define a function in terms of itself. This one is a natural for that, because the Collatz sequence starting at \(x\) is \(x\) along with the Collatz sequence starting at the next term. For example, if you start at 12, the next term is 6, so that the Collatz sequence starting at 12 is 12 followed by the Collatz sequence starting at 6. There is no dependence any further back. You can do recursion in R also; there is no problem with a function calling itself:
<- function(x) {
hotpo_rec if (x == 1) 1 else c(x, hotpo_rec(hotpo1(x)))
}
Recursive functions have two parts: a “base case” that says how you know you are done (the 1 here), and a “recursion” that says how you move to a simpler case, here working out the next term, getting the whole sequence for that, and gluing the input onto the front. It seems paradoxical that you define a function in terms of itself, but what you are doing is calling a simpler sequence, in this case one that is length one shorter than the sequence for the original input. Thus, we hope,38 we will eventually reach 1.
Does it work?
hotpo_rec(12)
## [1] 12 6 3 10 5 16 8 4 2 1
hotpo_rec(97)
## [1] 97 292 146 73 220 110 55 166 83 250 125 376 188 94 47
## [16] 142 71 214 107 322 161 484 242 121 364 182 91 274 137 412
## [31] 206 103 310 155 466 233 700 350 175 526 263 790 395 1186 593
## [46] 1780 890 445 1336 668 334 167 502 251 754 377 1132 566 283 850
## [61] 425 1276 638 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288
## [76] 3644 1822 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732
## [91] 866 433 1300 650 325 976 488 244 122 61 184 92 46 23 70
## [106] 35 106 53 160 80 40 20 10 5 16 8 4 2 1
It does.
Recursive functions are often simple to understand, but they are not
always very efficient. They can take a lot of memory, because they
have to handle the intermediate calls to the function, which they have
to save to use later (in the case of hotpo_rec(97)
there are a
lot of those). Recursive functions are often paired with a technique
called “memoization”, where each time you calculate the function’s
value, you save it in another array. The first thing you do in
the recursive function is to check whether you already have the
answer, in which case you just look it up and return it. It was a lot
of work here to calculate the sequence from 97, but if we had saved
the results, we would already have the answers for 292, 146, 73, 220
and so on, and getting those later would be a table lookup rather than
another recursive calculation.
\(\blacksquare\)
- Write two (very small) functions that take an entire sequence as input and return (i) the length of the sequence and (ii) the maximum value it attains.
Solution
These are both one-liners. Call the input whatever you like:
<- function(sequence) length(sequence)
hotpo_len <- function(sequence) max(sequence) hotpo_max
Because they are one-liners, you don’t even need the curly brackets, although there’s no problem if they are there.
Testing:
hotpo_len(hotpo(12))
## [1] 10
hotpo_max(hotpo(97))
## [1] 9232
This checks with what we had before.
\(\blacksquare\)
- Make a data frame consisting of the values 11 through 20, and,
using
tidyverse
ideas, obtain a data frame containing the Collatz sequences starting at each of those values, along with their lengths and their maximum values. Which sequence is longest? Which one goes up highest?
Solution
This one uses rowwise
ideas:39
tibble(x = 11:20) %>%
%>%
rowwise mutate(sequence = list(hotpo(x))) %>%
mutate(length = hotpo_len(sequence)) %>%
mutate(high = hotpo_max(sequence))
## # A tibble: 10 × 4
## # Rowwise:
## x sequence length high
## <int> <list> <int> <dbl>
## 1 11 <dbl [15]> 15 52
## 2 12 <dbl [10]> 10 16
## 3 13 <dbl [10]> 10 40
## 4 14 <dbl [18]> 18 52
## 5 15 <dbl [18]> 18 160
## 6 16 <dbl [5]> 5 16
## 7 17 <dbl [13]> 13 52
## 8 18 <dbl [21]> 21 52
## 9 19 <dbl [21]> 21 88
## 10 20 <dbl [8]> 8 20
First, we obtain a list-column containing the sequences (which is why its calculation needs a list
around it), then two
ordinary columns of their lengths and their maximum values.
The sequences for 18 and 19 are the longest, but the sequence for 15 goes up the highest.
\(\blacksquare\)
22.7 Coefficient of Variation
The coefficient of variation of a vector x
is defined as the standard deviation of x
divided by the mean of x
.
- Write a function called
cv
that calculates the coefficient of variation of its input and returns the result. You should use base R’s functions that reliably compute the pieces that you need.
Solution
I like to make a function skeleton first to be sure I remember all the pieces. You can do this by making a code chunk, typing fun
, waiting a moment, and selecting the “snippet” that is offered to you. That gives you this, with your cursor on name
:
<- function(variables) {
name
}
Give your function a name (I asked you to use cv
), hit tab, enter a name like x for the input, hit tab again, and then fill in the body of the function, to end up with something like this:
<- function(x) {
cv <- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x }
I think this is best, for reasons discussed below.
Some observations:
- in the last line of the function, you calculate something, and that something is the thing that gets returned. You do not need to save it in a variable, because then you have to return that variable by having its name alone on the last line.
- R has a
return()
function, but it is bad style to use it to return a value at the end of a function. The time to use it is when you test something early in the function and want to return early with a value like zero or missing without doing any actual calculation. Looking ahead a bit, you might want to return “missing” if the mean is zero before dividing by it, which you could do like this, this being good style:
<- function(x) {
cv2 <- mean(x)
mean_x if (mean_x == 0) return(NA)
<- sd(x)
sd_x /mean_x
sd_x }
use the built-in
mean
andsd
rather than trying to roll your own, because they have been tested by many users over a long time and they work. That’s what I meant by “pieces” in the question.you might be tempted to do something like this:
<- function(x) {
cv3 sd(x)/mean(x)
}
This will work, but it is not a good habit to get into, and thus not best as an answer to this question. This one line of code says three things: “work out the SD of x
”, “work out the mean of x
”, and “divide them by each other”. It is much clearer, for anyone else reading the code (including yourself in six months when you have forgotten what you were thinking) to have the three things one per line, so that anyone reading the code sees that each line does one thing and what that one thing is. There is a reason why production code and code golf are two very different things. Code that is easy to read is also easy to maintain, either by you or others.
- using something other than
mean
andsd
as the names of your intermediate results is a good idea because these names are already used by R (as the names of the functions that compute the mean and SD). Redefining names that R uses can make other code behave unpredictably and cause hard-to-find bugs.40
\(\blacksquare\)
- Use your function to find the coefficient of variation of the set of integers 1 through 5.
Solution
This can be as simple as
cv(1:5)
## [1] 0.5270463
or, equally good, define this into a vector first:
<- 1:5
y cv(y)
## [1] 0.5270463
or, displaying a little less understanding, type the five numbers into the vector first (or directly as input to the function):
<- c(1, 2, 3, 4, 5)
y cv(y)
## [1] 0.5270463
cv(c(1, 2, 3, 4, 5))
## [1] 0.5270463
\(\blacksquare\)
- Define a vector as follows:
<- c(-2.8, -1.8, -0.8, 1.2, 4.2) v
What is its coefficient of variation, according to your function? Does this make sense? Why did this happen? Explain briefly.
Solution
Try it and see:
cv(v)
## [1] 6.248491e+16
A very large number, much bigger than any of the data values; with these human-sized numbers, we’d expect a human-sized coefficient of variation as well.
What actually happened was that the mean of v
is this:
mean(v)
## [1] 4.440892e-17
Zero, or close enough, so that in calculating the coefficient of variation, we divided by something that was (almost) zero and got a result that was (almost) infinite.41
The possibility of getting a zero mean is why most people only calculate a coefficient of variation if all of the numbers are positive, which brings us to the next part:
\(\blacksquare\)
- Most people only calculate a coefficient of variation if there are no negative numbers. Rewrite your function so that it gives an error if there are any negative numbers in the input, and test it with the vector
v
above. Hint: you might need to adderror=TRUE
to your chunk header to allow your document to preview/knit (inside the curly brackets at the top of the chunk, after a comma).
Solution
This is a case for stopifnot
, or of if
coupled with stop
. Check this up front, as the first thing you do before you calculate anything else. As to what to check, there are several possibilities:
- stop if any of the numbers are negative
- continue if all of the numbers are positive
- stop if the smallest number is negative (the smallest number is negative if and only if not all the numbers are positive)
R has functions any
and all
that do what you’d expect:
<- 1:5
w w
## [1] 1 2 3 4 5
any(w>3.5)
## [1] TRUE
all(w<4.5)
## [1] FALSE
Are there any numbers greater than 3.5 (yes, 4 and 5); are all the numbers less than 4.5 (no, 5 isn’t).
Cite your sources for these if you use either of them, since this is the first place in the course that I’m mentioning either of them.
Remember that if you use stopifnot
, the condition that goes in there is what has to be true if the function is to run; if you use if
and stop
, the condition is what will stop the function running. With that in mind, I would code my three possibilities above this way. First off, here’s the original:
<- function(x) {
cv <- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x }
then, stop if any of the numbers are negative:
<- function(x) {
cv if (any(x<0)) stop("A value is negative")
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): A value is negative
continue if all the numbers are positive
<- function(x) {
cv stopifnot(all(x>0))
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): all(x > 0) is not TRUE
stop if the smallest value is negative
<- function(x) {
cv if (min(x)<0) stop("Smallest value is negative")
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): Smallest value is negative
There are (at least) three other possibilities: you can negate the logical condition and interchange if/stop
and stopifnot
, thus (at the expense of some clarity of reading):
continue if it is not true that any of the numbers are negative
<- function(x) {
cv stopifnot(!any(x<0))
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): !any(x < 0) is not TRUE
(you might be thinking of De Morgan’s laws here)
stop if it is not true that all the numbers are positive
<- function(x) {
cv if (!all(x>0)) stop("Not all values are positive")
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): Not all values are positive
continue if the smallest value is not negative
<- function(x) {
cv stopifnot(min(x)>=0)
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): min(x) >= 0 is not TRUE
or another way to do the last one, a more direct negation of the condition, which at my guess needs some extra brackets:
<- function(x) {
cv stopifnot(!(min(x)<0))
<- mean(x)
mean_x <- sd(x)
sd_x /mean_x
sd_x
}cv(v)
## Error in cv(v): !(min(x) < 0) is not TRUE
This one is hard to parse: what does that last message mean? I would take a negative off each side and read it as “min of x is negative is TRUE”, but that takes extra effort.
I said that last one needed some extra brackets. This is, I thought, to get the order of operations right (operator precedence); it turns out not to matter because “not” has lower precedence than most other things, so that these do actually work (the “not” is evaluated after the less-than and the other things, so last of all here, even though it appears to be “glued” to the min
):
!min(v)<0
## [1] FALSE
!min(1:5)<0
## [1] TRUE
See this for details. See especially the second set of examples, the ones beginning with “Special operators”, and see especially-especially the comment at the bottom of these examples! That is to say, you should put in the extra brackets unless you also make the case that they are not needed, because anyone reading your code is guaranteed to be confused by it when they read it (including you in six months, because you will not remember the operator priority of “not”).
My take is that one of the first three of the seven possibilities for coding stopifnot
or if
with stop
is the best, since these more obviously encode the condition for continuing or stopping as appropriate. There are two things here: one is that you have to get the code right, but the second is that you have to get the code clear, so that it is obvious to anyone reading it that it does the right thing (once again, this includes you in six months). On that score, the first three alternatives are a direct expression of what you want to achieve, and the last four make it look as if you found a way of coding it that worked and stopped there, without thinking about whether there were any other, clearer or more expressive, possibilities.
\(\blacksquare\)
22.8 Rescaling
Suppose we have this vector of values:
<- c(10, 14, 11)
z z
## [1] 10 14 11
We want to scale these so that the smallest value is 0 and the largest is 1. We are going to be doing this a lot, so we are going to write a function that will work for any input.
- Using a copy of my
z
, work outmin(z)
andmax(z)
. What do they do? Explain (very) briefly.
Solution
Simply this – define z
first:
<- c(10, 14, 11)
z min(z)
## [1] 10
max(z)
## [1] 14
They are respectively (and as you would guess) the smallest and largest of the values in z
. (A nice gentle warmup, but I wanted to get you on the right track for what is coming up.)
\(\blacksquare\)
- What do these lines of code do, using the same
z
that I had? Run them and see, and describe briefly whats
contains.
<- min(z)
lo <- max(z)
hi <- (z - lo) / (hi - lo)
s s
Solution
Here we go:
<- min(z)
lo <- max(z)
hi <- (z - lo) / (hi - lo)
s s
## [1] 0.00 1.00 0.25
The lowest value became 0, the highest 1, and the other one to something in between. Saying this shows the greatest insight.
Extra: the reason for doing it in three steps rather than one (see below) is (i) it makes it clearer what is going on (and thus makes it less likely that you make a mistake), and (ii) it is more efficient, since my way only finds the minimum once instead of twice. Compare this approach with mine above:
- min(z)) / (max(z) - min(z)) (z
## [1] 0.00 1.00 0.25
More complicated with all the brackets, and two min(z)
. Admittedly the difference here will be thousandths of a second, but why call a function twice when you don’t have to?
\(\blacksquare\)
- Write a function called
rescale
that implements the calculation above, for any input vector calledx
. (Note that I changed the name.)
Solution
Write a function skeleton:
<- function(x) {
rescale
}
and inside the curly brackets put the code from above, replacing z
with x
everywhere:
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi - lo) / (hi - lo)
(x }
You’ll need to make sure your function returns something to the outside world. Either don’t save the last line in s
(as I did here), or save it in s
and then return s
:
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi <- (x - lo) / (hi - lo)
s
s }
or use return()
if you must, but be aware that this is bad style in R (unlike Python, where you need it). The approved way of using return
in R is when you are returning something earlier than the last line of a function, for example, you are testing a simple case first and returning the value that goes with that, before getting down to the serious computation.
Extra: in the spirit of what’s coming up below, you might check first whether the maximum and minimum are the same and return something else if that’s the case:
<- function(x) {
rescale0 <- min(x)
lo <- max(x)
hi if (lo == hi) return(0)
<- (x - lo) / (hi - lo)
s
s }
This is good style; in this case, if lo
and hi
are the same, we want to return something else (zero) to the outside world, and then we do the calculation, knowing that lo
and hi
are different, so that we are sure we are not dividing by zero (but I get ahead of myself).
Doing it this way, there is something to be careful of: a function ought to return a predictable type of thing: numbers, in this case. If you have your function return text on error, like this:
<- function(x) {
rescale0 <- min(x)
lo <- max(x)
hi if (lo == hi) return("High and low need to be different")
<- (x - lo) / (hi - lo)
s
s }
then you can get into trouble:
rescale0(z)
## [1] 0.00 1.00 0.25
rescale0(c(3,3,3))
## [1] "High and low need to be different"
The first one is numbers and the second one is text.
\(\blacksquare\)
- Test your function on my
z
, and on another vector of your choosing. Explain briefly why the answer you get from your vector makes sense.
Solution
On my z
:
rescale(z)
## [1] 0.00 1.00 0.25
The same values as I got before, so that works.
For your vector, use whatever you like. I think it makes sense to have the values already in order, to make it easier to check. Here’s one possibility:
<- 2:6
w w
## [1] 2 3 4 5 6
and then
rescale(w)
## [1] 0.00 0.25 0.50 0.75 1.00
The smallest value is 2, which goes to zero; the largest is 6, which goes to 1, and the others are equally spaced between in both the input and the output.
Another possibility is to use a vector with values whose largest and smallest you can clearly see:
<- c(10, 11, 100, 0, 20)
w rescale(w)
## [1] 0.10 0.11 1.00 0.00 0.20
Clearly the smallest value is 0 and the largest is 100. These become 0 and 1, and these particular values make it easy to see what happened: each of the other values got divided by 100.
Some discussion is needed here, in that you need to say something convincing about why your answer is right.
Extra: This is why I had you use a name other than z
for the input to your function. The function can be used on any input, not just the z
that we tested it on. There’s another R-specific reason, which is that you need to be careful about using the named inputs only. Consider this function:
<- function(x) {
ff + z
x
}
ff(10)
## [1] 20 24 21
Where did z
come from? R used the z
we had before, which is rather dangerous: what if we had a z
lying around from some completely different work? Much better to have a function work with only inputs in the top line:
<- function(x, z) {
ff + z
x
}ff(10, 3)
## [1] 13
ff(10)
## Error in ff(10): argument "z" is missing, with no default
The first time, the two inputs are added together, but the second time it tells you it was expecting a value to use for z
and didn’t see one. Much safer.
\(\blacksquare\)
- What happens if your input to
rescale
is a vector of numbers all the same? Give an example. Rewrite your function to intercept this case and give a helpful error message.
Solution
First, try it and see. Any collection of values all the same will do:
rescale(c(3,3,3))
## [1] NaN NaN NaN
NaN
stands for “not a number”. The way we got it is that the minimum and maximum were the same, so our function ended up dividing by zero (in fact, working out zero divided by zero). This is, in R terms, not even an error, but the answer is certainly not helpful.
The easiest way to check inputs is to use stopifnot
to express what should be true if the function is to proceed. Here, we want the maximum and minimum to be different, so:
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
- lo) / (hi - lo)
(x
}rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): hi != lo is not TRUE
This is much clearer: I only have to recall what my hi
and lo
are to see what the problem is.
Extra 1: by calculating and saving the min and max up front, I still only need to calculate them once. If you do it this way:
<- function(x) {
rescale stopifnot(max(x) != min(x))
- min(x)) / (max(x) - min(x))
(x
}rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): max(x) != min(x) is not TRUE
you get a slightly more informative error message, but you have calculated the max twice and the min three times for no reason.
Extra 2: stopifnot
is shorthand for this:
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi if (hi == lo) stop("min and max are the same!")
- lo) / (hi - lo)
(x
}rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): min and max are the same!
I didn’t show you this, so if you use stop
, you must tell me where you found out about it. This is better than returning some text (see rescale0
above) or printing a message: it’s an error, so you want to make it look like an error. I am very sympathetic to being persuaded that this is better than stopifnot
, because you can customize the message (and, also, you don’t have to go through the double-negative contortions of stopifnot
). Another way to use stopifnot
and get a customized message is this one (that I only learned about right when you were writing this Assignment):
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi stopifnot("high and low must be different" = (hi != lo))
- lo) / (hi - lo)
(x
}rescale(c(3,3,3))
## Error in rescale(c(3, 3, 3)): high and low must be different
This is called a “named argument”, and the name, if given, is used as an error message.
Extra 3: returning to my rescale0
from above:
rescale0
## function(x) {
## lo <- min(x)
## hi <- max(x)
## if (lo == hi) return("High and low need to be different")
## s <- (x - lo) / (hi - lo)
## s
## }
## <bytecode: 0x560e93136430>
this can get you into trouble if you use it in a dataframe. This is a bit complicated, since it has to use list-columns. Here we go:
tibble(x = list(z, c(3,3,3)))
## # A tibble: 2 × 1
## x
## <list>
## 1 <dbl [3]>
## 2 <dbl [3]>
Just to check that this does contain what you think it does:
tibble(x = list(z, c(3,3,3))) %>% unnest(x)
## # A tibble: 6 × 1
## x
## <dbl>
## 1 10
## 2 14
## 3 11
## 4 3
## 5 3
## 6 3
So now, for each of those two input vectors, what happens when we run rescale0
on them? This is rowwise
:
tibble(x = list(z, c(3,3,3))) %>%
rowwise() %>%
mutate(ans = list(rescale0(x)))
## # A tibble: 2 × 2
## # Rowwise:
## x ans
## <list> <list>
## 1 <dbl [3]> <dbl [3]>
## 2 <dbl [3]> <chr [1]>
The first ans
is a vector of 3 numbers, and the second one is one piece of text (the “error message”). I was actually surprised it got this far. So what happens when we unnest
the second column?
tibble(x = list(z, c(3,3,3))) %>%
rowwise() %>%
mutate(ans = list(rescale0(x))) %>%
unnest(ans)
## Error in `list_unchop()`:
## ! Can't combine `x[[1]]` <double> and `x[[2]]` <character>.
Now we get a confusing error: it’s here that combining some numbers and some text in one column of a dataframe doesn’t work. To forestall this, we need to go back and rewrite rescale0
to not mix things up. Having it return an error, as the latest version of rescale
does, gives an error here too, but at least we know what it means:
tibble(x = list(z, c(3,3,3))) %>%
rowwise() %>%
mutate(ans = list(rescale(x))) %>%
unnest(ans)
## Error in `mutate()`:
## ! Problem while computing `ans = list(rescale(x))`.
## ℹ The error occurred in row 2.
## Caused by error in `rescale()`:
## ! high and low must be different
because this is the error we anticipated: it says “somewhere within the list-column x
, specifically in its second row, is a vector where everything is the same”.
\(\blacksquare\)
- Make a dataframe (containing any numeric values), and in it create a new column containing the rescaled version of one of its columns, using your function. Show your result.
Solution
This is less difficult than you might be expecting: make a dataframe with at least one numeric column, and use mutate
:
<- tibble(y=2:6)
d d
## # A tibble: 5 × 1
## y
## <int>
## 1 2
## 2 3
## 3 4
## 4 5
## 5 6
and then
%>% mutate(s=rescale(y)) d
## # A tibble: 5 × 2
## y s
## <int> <dbl>
## 1 2 0
## 2 3 0.25
## 3 4 0.5
## 4 5 0.75
## 5 6 1
You can supply the values for what I called y
, or use random numbers. It’s easier for you to check that it has worked if your column playing the role of my y
has not too many values in it.
Extra: this is actually already in the tidyverse
under the name percent_rank
(“percentile ranks”):
%>% mutate(s = percent_rank(y)) d
## # A tibble: 5 × 2
## y s
## <int> <dbl>
## 1 2 0
## 2 3 0.25
## 3 4 0.5
## 4 5 0.75
## 5 6 1
The value 5, for example, is at the 75th percentile.
\(\blacksquare\)
- We might want to rescale the input not to be between 0 and 1, but between two values
a
andb
that we specify as input. Ifa
and/orb
are not given, we want to use the values 0 fora
and 1 forb
. Rewrite your function to rescale the input to be betweena
andb
instead of 0 and 1. Hint: allow your function to produce values between 0 and 1 as before, and then note that if all the values in a vectors
are between 0 and 1, then all the values ina+(b-a)*s
are between \(a\) and \(b\).
Solution
I’m showing you my thought process in this one. The answer I want from you is the one at the end.
So, start by copying and pasting what you had before:
<- function(x) {
rescale <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
- lo) / (hi - lo)
(x }
On the top line, add the extra inputs and their default values. I also changed the name of my function, for reasons you’ll see later:
<- function(x, a=0, b=1) {
rescale2 <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
- lo) / (hi - lo)
(x }
Save the last line, since we have to do something else with it:
<- function(x, a=0, b=1) {
rescale2 <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
<- (x - lo) / (hi - lo)
s }
and finally add the calculation in the hint, which we don’t need to save because we are returning it:
<- function(x, a=0, b=1) {
rescale2 <- min(x)
lo <- max(x)
hi stopifnot(hi != lo)
<- (x - lo) / (hi - lo)
s + (b-a) * s
a }
This complete function is what I want to see from you. (You should keep the stopifnot
, because this function will have the exact same problem as the previous one if all the values in x
are the same.)
A better way is to observe that you can call functions inside functions. The function above is now a bit messy since it has several steps. Something that corresponds better to my hint is to call the original rescale
first, and then modify its result:
<- function(x, a=0, b=1) {
rescale3 <- rescale(x)
s + (b-a) * s
a }
The logic to this is rather clearly “rescale the input to be between 0 and 1, then rescale that to be between \(a\) and \(b\).” My rescale2
does exactly the same thing, but it’s much less clear that it does so, unless you happen to have in your head how rescale
works. (I think you are more likely to remember, sometime in the future, what rescale
does, compared to precisely how it works.)
That is why rescale3
is better than rescale2
. Remember that you can, and generally should, use functions that have already been written (by you or someone else) as part of functions that do more complex things. See also my second point below.
Extra: there are two important principles about why functions are important:
- they allow you to re-do a calculation on many different inputs (the point I’ve been making up to now)
- by abstracting a calculation into a thing with a name, it makes it easier to understand that calculation’s role in something bigger. The only thing we had to remember in
rescale3
is what the last line did, because the name of the function called on the first line tells us what happens there. This is much easier than remembering what the first four lines ofrescale2
do.
The second principle here is what psychologists call “chunking”: you view a thing like my function rescale
as a single item, rather than as four separate lines of code, and then that single item can be part of something larger (like my rescale3
), and you have a smaller number of things to keep track of.
\(\blacksquare\)
- Test your new function two or more times, on input where you know or can guess what the output is going to be. In each case, explain briefly why your output makes sense.
Solution
I’ll start by using the default values for a
and b
(so I don’t have to specify them):
rescale2(2:6)
## [1] 0.00 0.25 0.50 0.75 1.00
rescale3(2:6)
## [1] 0.00 0.25 0.50 0.75 1.00
I did both of the variants of my function; of course, you’ll only have one variant.
We got the same answer as before for the same input, so the default values \(a=0, b=1\) look as if they have been used.
Let’s try a different one:
<- c(7, 11, 12)
v rescale2(v, 10, 30)
## [1] 10 26 30
rescale3(v, 10, 30)
## [1] 10 26 30
The lowest value in v
has become 10, and the highest has become 30. (Also, the in-between value 11 was closer to 12 than to 7, and it has become something closer to 30 than to 10.)
Extra: you can also name any of your inputs:
rescale3(x=v, a=10, b=30)
## [1] 10 26 30
and if you name them, you can shuffle the order too:
rescale3(a=10, b=30, x=v)
## [1] 10 26 30
The point of doing more than one test is to check that different aspects of your function all work. Therefore, the best testing here checks that the defaults work, and that the answer is sensible for some different a
and b
(to check that this works as well).
When you write your version of rescale
with the optional inputs, it’s best if you do it so that the things you have to supply (the vector of numbers) is first. If you put a
and b
first, when you want to omit them, you’ll have to call the input vector by name, like this:
rescale3(x=v)
## [1] 0.0 0.8 1.0
because otherwise the input vector will be taken to be a
, not what you want.
\(\blacksquare\)
Hotpo is short for half or triple-plus-one.↩︎
R style is to use the last line of the function for the return value, unless you are jumping out of the function before the end, in which case use
return
.↩︎Hotpo is short for half or triple-plus-one.↩︎
Nobody knows whether you always get to 1, but also nobody has ever found a case where you don’t. Collatz’s conjecture, that you will get to 1 eventually, is known to be true for all starting \(x_1\) up to some absurdly large number, but not for all starting points.↩︎
I should have been more careful in my functions to make sure everything was integers, and, in particular, to do integer division by 2 because I knew that this division was going to come out even.↩︎
This is something I’ve done in the past and been bitten by, so I am trying to get myself not to do it any more.↩︎
In computing with decimal numbers, things are almost never exactly zero or exactly infinite; they are very small or very big. The mean here, which you would calculate to be zero, is less than the so-called machine epsilon, which is about 10 to the minus 16 in R (R works in double precision). A mean that small is, to the computer, indistinguishable from zero. It came out that way because the last value is added to a total that is by that point negative, and so you have a loss of accuracy because of subtracting nearly equal quantities. I learned all about this stuff in my first real computer science course, which I think was a numerical math course, some absurd number of years ago. It looks as if this gets taught in CSCC37 these days.↩︎