A brief look at some of what’s new in Purrr 1.0
map
The square root function is vectorized:
sqrt(1:10)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[8] 2.828427 3.000000 3.162278
so let’s make ourselves work harder by defining one that is not:
sqrt1 <- function(x) sqrt(x[1])
sqrt1(1:10)
[1] 1
How can we use sqrt1
to calculate the square roots of all of the numbers 1 through 10? This is what map
and friends from purrr
are for.
There now three ways to use map
.
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[8] 2.828427 3.000000 3.162278
I never liked this because the thing I was for-eaching over had to be the first input of the function, and then you have to add arguments after the first one separately. For example, if you want base 10 logs1 of a bunch of numbers:2
[1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
[7] 0.8450980 0.9030900 0.9542425 1.0000000
These examples use map_dbl
because sqrt1
and log
return a decimal number or dbl
.
This approach would be awkward if you wanted to compute, let’s say, the log of 10 to different bases:
log_base <- function(x) log(10, x)
base <- c(2, exp(1), 10) # the second one is e
base %>% map_dbl(log_base)
[1] 3.321928 2.302585 1.000000
I had to define a helper function with the thing to be for-eached over as its first argument.
Historically, this notation comes from the apply
family of functions. In this case:
sapply(1:10, log, 10)
[1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
[7] 0.8450980 0.9030900 0.9542425 1.0000000
Second, the way I came to prefer (which I will now have to unlearn, see below) is this:
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[8] 2.828427 3.000000 3.162278
I would read this to myself in English as “for each thing in 1 through 10, work out the square root of it”, where ~
was read as “work out” and .
(or .x
if you prefer) was read as “it”.
You can also create a new column of a dataframe this way:
# A tibble: 10 × 2
x root
<int> <dbl>
1 1 1
2 2 1.41
3 3 1.73
4 4 2
5 5 2.24
6 6 2.45
7 7 2.65
8 8 2.83
9 9 3
10 10 3.16
This is a little odd, for learners,
because the thing inside the sqrt1
is crying out to be called x
. I still think this is all right: “for each thing in x
, work out the square root of it”, in the same way that you would use i
as a loop index in a for loop.
The log examples both work more smoothly this way:
[1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
[7] 0.8450980 0.9030900 0.9542425 1.0000000
and
without the need to handle additional inputs specially, and without the requirement to have the “it” be the first input to the function. The call to the function looks exactly the same as it does when you call it outside a map
, which makes it easier to learn.
A third way of specifying what to “work out” is to use the new (to R 4.0) concept of an “anonymous function”: a function, typically a one-liner, defined inline without a name. This is how it goes:
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[8] 2.828427 3.000000 3.162278
This one, to my mind, is not any clearer than the “work out” notation with a squiggle, though you can still cast your eyes over it and read “for each thing in 1 through 10, work out the square root of it” with a bit of practice.
This notation wins where the input things have names:3
number <- 1:10
map_dbl(number, \(number) sqrt1(number))
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[8] 2.828427 3.000000 3.162278
And thus also in defining new columns of a dataframe:
# A tibble: 10 × 2
x root
<int> <dbl>
1 1 1
2 2 1.41
3 3 1.73
4 4 2
5 5 2.24
6 6 2.45
7 7 2.65
8 8 2.83
9 9 3
10 10 3.16
The clarity comes from the ability to use the name of the input column also as the name of the input to the anonymous function, so that everything joins up: “for each thing in x
, work out the square root of that x
”.4
This also works if you are for-eaching over two columns, for example working out logs of different numbers to different bases:
x <- 2:4
base
[1] 2.000000 2.718282 10.000000
crossing
(from tidyr
) makes a dataframe out of all combinations of its inputs, and so:
# A tibble: 9 × 3
x base log_of
<int> <dbl> <dbl>
1 2 2 1
2 2 2.72 0.693
3 2 10 0.301
4 3 2 1.58
5 3 2.72 1.10
6 3 10 0.477
7 4 2 2
8 4 2.72 1.39
9 4 10 0.602
This doesn’t only apply to making dataframe columns, but again works nicely any time the input things have names:
u <- 1:5
v <- 11:15
map2_dbl(u, v, \(u, v) sqrt1(u+v))
[1] 3.464102 3.741657 4.000000 4.242641 4.472136
When I am teaching this stuff, I say that if the thing you are working out is complicated, write a function to do that first, and then worry about for-eaching it. For example, imagine you want a function that takes an integer as input, and the output is:
This is a bit long to put in the anonymous function of a map
, so we’ll define a function hotpo
to do it first:5
hotpo <- function(x) {
stopifnot(x == round(x)) # error out if input is not an integer
if (x %% 2 == 0) {
ans <- x %/% 2
} else {
ans <- 3 * x + 1
}
ans
}
hotpo(4)
[1] 2
hotpo(3)
[1] 10
hotpo(5.6)
Error in hotpo(5.6): x == round(x) is not TRUE
So now, we can use a map
to work out hotpo
of each of the numbers 1 through 6:
first <- 1:6
map_int(first, hotpo)
[1] 4 1 10 2 16 3
or
map_int(first, ~hotpo(.))
[1] 4 1 10 2 16 3
or
map_int(first, \(first) hotpo(first))
[1] 4 1 10 2 16 3
where we call our function in the anonymous function. The answer is the same any of these ways, and you can reasonably argue that the last one is the clearest because the inputs to the map_int
and the function have the same name.
This one is map_int
because hotpo
returns an integer.
This function is actually more than a random function defined on integers; it is part of an open problem in number theory called the Collatz conjecture. The idea is if you do this:
10
[1] 10
hotpo(10)
[1] 5
hotpo(hotpo(10))
[1] 16
hotpo(hotpo(hotpo(10)))
[1] 8
hotpo(hotpo(hotpo(hotpo(10))))
[1] 4
hotpo(hotpo(hotpo(hotpo(hotpo(10)))))
[1] 2
hotpo(hotpo(hotpo(hotpo(hotpo(hotpo(10))))))
[1] 1
you obtain a sequence of integers. If you ever get to 1, you’ll go back to 4, 2, 1, and loop forever, so we’ll say the sequence ends if it gets to 1. The Collatz conjecture says that, no matter where you start, you will always get to 1.6
Let’s assume that we are going to get to 1, and write a function to generate the whole sequence. The two key ingredients are: the hotpo
function we wrote, and a while
loop to keep going until we do get to 1:
hotpo_seq <- function(x) {
ans <- x
while(x != 1) {
x <- hotpo(x)
ans <- c(ans, x)
}
ans
}
and test it:
hotpo_seq(10)
[1] 10 5 16 8 4 2 1
the same short ride that we had above, and a rather longer one:
hotpo_seq(27)
[1] 27 82 41 124 62 31 94 47 142 71 214 107 322
[14] 161 484 242 121 364 182 91 274 137 412 206 103 310
[27] 155 466 233 700 350 175 526 263 790 395 1186 593 1780
[40] 890 445 1336 668 334 167 502 251 754 377 1132 566 283
[53] 850 425 1276 638 319 958 479 1438 719 2158 1079 3238 1619
[66] 4858 2429 7288 3644 1822 911 2734 1367 4102 2051 6154 3077 9232
[79] 4616 2308 1154 577 1732 866 433 1300 650 325 976 488 244
[92] 122 61 184 92 46 23 70 35 106 53 160 80 40
[105] 20 10 5 16 8 4 2 1
Now, let’s suppose that we want to make a dataframe with the sequences for the starting points 1 through 10. The sequence is a vector rather than an integer, so that we need to do this with map
:7
# A tibble: 10 × 2
start sequence
<int> <list>
1 1 <int [1]>
2 2 <dbl [2]>
3 3 <dbl [8]>
4 4 <dbl [3]>
5 5 <dbl [6]>
6 6 <dbl [9]>
7 7 <dbl [17]>
8 8 <dbl [4]>
9 9 <dbl [20]>
10 10 <dbl [7]>
and we have made a list-column. You can see by the lengths of the vectors in the list-column how long each sequence is.8 We might want to make explicit how long each sequence is, and how high it goes:
tibble(start = 1:10) %>%
mutate(sequence = map(start, \(start) hotpo_seq(start))) %>%
mutate(seq_len = map_int(sequence, \(sequence) length(sequence))) %>%
mutate(seq_max = map_int(sequence, \(sequence) max(sequence))) -> seq_info
seq_info
# A tibble: 10 × 4
start sequence seq_len seq_max
<int> <list> <int> <int>
1 1 <int [1]> 1 1
2 2 <dbl [2]> 2 2
3 3 <dbl [8]> 8 16
4 4 <dbl [3]> 3 4
5 5 <dbl [6]> 6 16
6 6 <dbl [9]> 9 16
7 7 <dbl [17]> 17 52
8 8 <dbl [4]> 4 8
9 9 <dbl [20]> 20 52
10 10 <dbl [7]> 7 16
To verify for a starting point of 7:
This does indeed have a length of 17 and goes up as high as 52 before coming back down to 1.
We don’t have to make a dataframe of these (though that, these days, is usually my preferred way of working). We can instead put the sequences in a list
. This one is a “named list”, with each sequence paired with its starting point (its “name”):
seq_list <- seq_info$sequence
names(seq_list) <- seq_info$start
seq_list
$`1`
[1] 1
$`2`
[1] 2 1
$`3`
[1] 3 10 5 16 8 4 2 1
$`4`
[1] 4 2 1
$`5`
[1] 5 16 8 4 2 1
$`6`
[1] 6 3 10 5 16 8 4 2 1
$`7`
[1] 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
$`8`
[1] 8 4 2 1
$`9`
[1] 9 28 14 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
$`10`
[1] 10 5 16 8 4 2 1
If these were in a dataframe as above, a filter
would pick out the sequences for particular starting points. As an example, we will pick out the sequences for odd-numbered starting points.
Here, this allows us to learn about the new keep_at
and discard_at
.
There is already keep
and discard
,9 for selecting by value, but the new ones allow selecting by name.
There are different ways to use keep_at
, but one is to write a function that accepts a name and returns TRUE
if that is one of the names you want to keep. Mine is below. The names are text, so I convert the name to an integer and then test it for oddness as we did in hotpo
:
keep the sequences for odd-numbered starting points
and now I keep the sequences that have odd starting points thus:
$`1`
[1] 1
$`3`
[1] 3 10 5 16 8 4 2 1
$`5`
[1] 5 16 8 4 2 1
$`7`
[1] 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
$`9`
[1] 9 28 14 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
discard_at
selects the ones for which the helper function is FALSE
, which in this case will give us the even-numbered starting points:
seq_list %>%
discard_at(\(x) is_odd(x))
$`2`
[1] 2 1
$`4`
[1] 4 2 1
$`6`
[1] 6 3 10 5 16 8 4 2 1
$`8`
[1] 8 4 2 1
$`10`
[1] 10 5 16 8 4 2 1
I have long been a devotee of the lambda-function notation with a map
:
x <- 1:5
map_dbl(x, ~sqrt1(.))
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
but I have always had vague misgivings about teaching this, because it is not immediately obvious why the thing inside sqrt1
is not also x
. The reason, of course, is the same as this in Python:
= ['a', 'b', 'c']
x for i in x:
print(i)
a
b
c
where i
stands for “the element of x
that I am currently looking at”, but it takes a bit of thinking for the learner to get to that point.
Using the anonymous function approach makes things a bit clearer:
x <- 1:5
map_dbl(x, \(x) sqrt1(x))
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
where x
appears three times in the map
, first as the vector of values of which we want the square roots, and then as the input to sqrt1
, so that everything appears to line up.
But there is some sleight of hand here: the meaning of x
actually changes as you go along! The first x
is a vector, but the second and third x
values are numbers, elements of the vector x
. Maybe this is all right, because we are used to treating vectors elementwise in R:
# A tibble: 5 × 2
x root
<int> <dbl>
1 1 1
2 2 1.41
3 3 1.73
4 4 2
5 5 2.24
Functions like sqrt
are vectorized, so the mutate
really means something like “take the elements of x
one at a time and take the square root of each one, gluing the result back together into a vector”. So, in the grand scheme of things, I am sold on the (new) anonymous function way of running map
, and I think I will be using this rather than the lambda-function way of doing things in the future.
Now, if you’ll excuse me, I now have to attend to all the times I’ve used map
in my lecture notes!
R’s log
function has two arguments: the number whose log you want, and then the base of the log, which defaults to \(e\).↩︎
Ignoring the fact that log
is vectorized.↩︎
The logic here seems to require the vector to have a singular name.↩︎
The input to the anonymous function could be called anything, but it seems like a waste to not use the same name as the column being for-eached over.↩︎
%/%
is integer division, discarding the remainder, and %%
is the remainder itself. We need to be careful with the division because, for example, 4 / 2
is actually a decimal number, what we old FORTRAN programmers used to write as 2.0
or 2.
.↩︎
Spoiler: nobody has been able to prove that this is always true, but every starting point that has been tried gets to 1.↩︎
Using plain map
means that its output will be a list
, and in a dataframe will result in the new column being a list-column with something more than a single number stored in each cell.↩︎
I am a little bothered by most of them being dbl
rather than int
.↩︎
I must be having flashbacks of SAS, because I expected the opposite of “keep” to be “drop”.↩︎
For attribution, please cite this work as
Butler (2022, Dec. 23). Ken's Blog: Looking in on Purrr 1.0. Retrieved from http://ritsokiguess.site/blogg/posts/2022-12-23-looking-in-on-purrr-10/
BibTeX citation
@misc{butler2022looking, author = {Butler, Ken}, title = {Ken's Blog: Looking in on Purrr 1.0}, url = {http://ritsokiguess.site/blogg/posts/2022-12-23-looking-in-on-purrr-10/}, year = {2022} }