Ken's Blog: Looking in on Purrr 1.0

Ken Butler

Packages

library(tidyverse) # for purrr, the magrittr pipe, and crossing from tidyr

Square roots (and logs) with `map`

Introduction

The square root function is vectorized:

sqrt(1:10)

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
 [8] 2.828427 3.000000 3.162278

so let’s make ourselves work harder by defining one that is not:

sqrt1 <- function(x) sqrt(x[1])
sqrt1(1:10)

[1] 1

How can we use sqrt1 to calculate the square roots of all of the numbers 1 through 10? This is what map and friends from purrr are for.

There now three ways to use map.

Method 1: the original way

1:10 %>% map_dbl(sqrt1)

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
 [8] 2.828427 3.000000 3.162278

I never liked this because the thing I was for-eaching over had to be the first input of the function, and then you have to add arguments after the first one separately. For example, if you want base 10 logs¹ of a bunch of numbers:²

1:10 %>% map_dbl(log, 10)

 [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
 [7] 0.8450980 0.9030900 0.9542425 1.0000000

These examples use map_dbl because sqrt1 and log return a decimal number or dbl.

This approach would be awkward if you wanted to compute, let’s say, the log of 10 to different bases:

log_base <- function(x) log(10, x)
base <- c(2, exp(1), 10) # the second one is e
base %>% map_dbl(log_base)

[1] 3.321928 2.302585 1.000000

I had to define a helper function with the thing to be for-eached over as its first argument.

Historically, this notation comes from the apply family of functions. In this case:

sapply(1:10, log, 10)

 [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
 [7] 0.8450980 0.9030900 0.9542425 1.0000000

Method 2: lambda functions

Second, the way I came to prefer (which I will now have to unlearn, see below) is this:

1:10 %>% map_dbl(~sqrt1(.))

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
 [8] 2.828427 3.000000 3.162278

I would read this to myself in English as “for each thing in 1 through 10, work out the square root of it”, where ~ was read as “work out” and . (or .x if you prefer) was read as “it”.

You can also create a new column of a dataframe this way:

tibble(x = 1:10) %>% 
  mutate(root = map_dbl(x, ~sqrt1(.)))

# A tibble: 10 × 2
       x  root
   <int> <dbl>
 1     1  1   
 2     2  1.41
 3     3  1.73
 4     4  2   
 5     5  2.24
 6     6  2.45
 7     7  2.65
 8     8  2.83
 9     9  3   
10    10  3.16

This is a little odd, for learners, because the thing inside the sqrt1 is crying out to be called x. I still think this is all right: “for each thing in x, work out the square root of it”, in the same way that you would use i as a loop index in a for loop.

The log examples both work more smoothly this way:

1:10 %>% map_dbl(~log(., 10))

 [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
 [7] 0.8450980 0.9030900 0.9542425 1.0000000

and

base

[1]  2.000000  2.718282 10.000000

base %>% map_dbl(~log(10, .))

[1] 3.321928 2.302585 1.000000

without the need to handle additional inputs specially, and without the requirement to have the “it” be the first input to the function. The call to the function looks exactly the same as it does when you call it outside a map, which makes it easier to learn.

Method 3: anonymous functions

A third way of specifying what to “work out” is to use the new (to R 4.0) concept of an “anonymous function”: a function, typically a one-liner, defined inline without a name. This is how it goes:

1:10 %>% map_dbl(\(x) sqrt1(x))

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
 [8] 2.828427 3.000000 3.162278

This one, to my mind, is not any clearer than the “work out” notation with a squiggle, though you can still cast your eyes over it and read “for each thing in 1 through 10, work out the square root of it” with a bit of practice.

This notation wins where the input things have names:³

number <- 1:10
map_dbl(number, \(number) sqrt1(number))

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
 [8] 2.828427 3.000000 3.162278

And thus also in defining new columns of a dataframe:

tibble(x = 1:10) %>% 
  mutate(root = map_dbl(x, \(x) sqrt1(x)))

# A tibble: 10 × 2
       x  root
   <int> <dbl>
 1     1  1   
 2     2  1.41
 3     3  1.73
 4     4  2   
 5     5  2.24
 6     6  2.45
 7     7  2.65
 8     8  2.83
 9     9  3   
10    10  3.16

The clarity comes from the ability to use the name of the input column also as the name of the input to the anonymous function, so that everything joins up: “for each thing in x, work out the square root of that x”.⁴

This also works if you are for-eaching over two columns, for example working out logs of different numbers to different bases:

x <- 2:4
base

[1]  2.000000  2.718282 10.000000

crossing (from tidyr) makes a dataframe out of all combinations of its inputs, and so:

crossing(x, base) %>% 
  mutate(log_of = map2_dbl(x, base, \(x, base) log(x, base)))

# A tibble: 9 × 3
      x  base log_of
  <int> <dbl>  <dbl>
1     2  2     1    
2     2  2.72  0.693
3     2 10     0.301
4     3  2     1.58 
5     3  2.72  1.10 
6     3 10     0.477
7     4  2     2    
8     4  2.72  1.39 
9     4 10     0.602

This doesn’t only apply to making dataframe columns, but again works nicely any time the input things have names:

u <- 1:5
v <- 11:15
map2_dbl(u, v, \(u, v) sqrt1(u+v))

[1] 3.464102 3.741657 4.000000 4.242641 4.472136

Collatz

When I am teaching this stuff, I say that if the thing you are working out is complicated, write a function to do that first, and then worry about for-eaching it. For example, imagine you want a function that takes an integer as input, and the output is:

if the input is even, half the input
if the input is odd, three times the input plus one

This is a bit long to put in the anonymous function of a map, so we’ll define a function hotpo to do it first:⁵

hotpo <- function(x) {
  stopifnot(x == round(x)) # error out if input is not an integer
  if (x %% 2 == 0) {
    ans <- x %/% 2
  } else {
    ans <- 3 * x + 1
  }
  ans
}
hotpo(4)

[1] 2

hotpo(3)

[1] 10

hotpo(5.6)

Error in hotpo(5.6): x == round(x) is not TRUE

So now, we can use a map to work out hotpo of each of the numbers 1 through 6:

first <- 1:6
map_int(first, hotpo)

[1]  4  1 10  2 16  3

map_int(first, ~hotpo(.))

[1]  4  1 10  2 16  3

map_int(first, \(first) hotpo(first))

[1]  4  1 10  2 16  3

where we call our function in the anonymous function. The answer is the same any of these ways, and you can reasonably argue that the last one is the clearest because the inputs to the map_int and the function have the same name.

This one is map_int because hotpo returns an integer.

This function is actually more than a random function defined on integers; it is part of an open problem in number theory called the Collatz conjecture. The idea is if you do this:

[1] 10

hotpo(10)

[1] 5

hotpo(hotpo(10))

[1] 16

hotpo(hotpo(hotpo(10)))

[1] 8

hotpo(hotpo(hotpo(hotpo(10))))

[1] 4

hotpo(hotpo(hotpo(hotpo(hotpo(10)))))

[1] 2

hotpo(hotpo(hotpo(hotpo(hotpo(hotpo(10))))))

[1] 1

you obtain a sequence of integers. If you ever get to 1, you’ll go back to 4, 2, 1, and loop forever, so we’ll say the sequence ends if it gets to 1. The Collatz conjecture says that, no matter where you start, you will always get to 1.⁶

Let’s assume that we are going to get to 1, and write a function to generate the whole sequence. The two key ingredients are: the hotpo function we wrote, and a while loop to keep going until we do get to 1:

hotpo_seq <- function(x) {
  ans <- x
  while(x != 1) {
    x <- hotpo(x)
    ans <- c(ans, x)
  }
  ans
}

and test it:

hotpo_seq(10)

[1] 10  5 16  8  4  2  1

the same short ride that we had above, and a rather longer one:

hotpo_seq(27)

  [1]   27   82   41  124   62   31   94   47  142   71  214  107  322
 [14]  161  484  242  121  364  182   91  274  137  412  206  103  310
 [27]  155  466  233  700  350  175  526  263  790  395 1186  593 1780
 [40]  890  445 1336  668  334  167  502  251  754  377 1132  566  283
 [53]  850  425 1276  638  319  958  479 1438  719 2158 1079 3238 1619
 [66] 4858 2429 7288 3644 1822  911 2734 1367 4102 2051 6154 3077 9232
 [79] 4616 2308 1154  577 1732  866  433 1300  650  325  976  488  244
 [92]  122   61  184   92   46   23   70   35  106   53  160   80   40
[105]   20   10    5   16    8    4    2    1

Now, let’s suppose that we want to make a dataframe with the sequences for the starting points 1 through 10. The sequence is a vector rather than an integer, so that we need to do this with map:⁷

tibble(start = 1:10) %>% 
  mutate(sequence = map(start, \(start) hotpo_seq(start)))

# A tibble: 10 × 2
   start sequence  
   <int> <list>    
 1     1 <int [1]> 
 2     2 <dbl [2]> 
 3     3 <dbl [8]> 
 4     4 <dbl [3]> 
 5     5 <dbl [6]> 
 6     6 <dbl [9]> 
 7     7 <dbl [17]>
 8     8 <dbl [4]> 
 9     9 <dbl [20]>
10    10 <dbl [7]>

and we have made a list-column. You can see by the lengths of the vectors in the list-column how long each sequence is.⁸ We might want to make explicit how long each sequence is, and how high it goes:

tibble(start = 1:10) %>% 
  mutate(sequence = map(start, \(start) hotpo_seq(start))) %>% 
  mutate(seq_len = map_int(sequence, \(sequence) length(sequence))) %>% 
  mutate(seq_max = map_int(sequence, \(sequence) max(sequence))) -> seq_info
seq_info

# A tibble: 10 × 4
   start sequence   seq_len seq_max
   <int> <list>       <int>   <int>
 1     1 <int [1]>        1       1
 2     2 <dbl [2]>        2       2
 3     3 <dbl [8]>        8      16
 4     4 <dbl [3]>        3       4
 5     5 <dbl [6]>        6      16
 6     6 <dbl [9]>        9      16
 7     7 <dbl [17]>      17      52
 8     8 <dbl [4]>        4       8
 9     9 <dbl [20]>      20      52
10    10 <dbl [7]>        7      16

To verify for a starting point of 7:

q <- hotpo_seq(7)
q

 [1]  7 22 11 34 17 52 26 13 40 20 10  5 16  8  4  2  1

length(q)

[1] 17

This does indeed have a length of 17 and goes up as high as 52 before coming back down to 1.

Keeping and discarding by name

We don’t have to make a dataframe of these (though that, these days, is usually my preferred way of working). We can instead put the sequences in a list. This one is a “named list”, with each sequence paired with its starting point (its “name”):

seq_list <- seq_info$sequence
names(seq_list) <- seq_info$start
seq_list

$`1`
[1] 1

$`2`
[1] 2 1

$`3`
[1]  3 10  5 16  8  4  2  1

$`4`
[1] 4 2 1

$`5`
[1]  5 16  8  4  2  1

$`6`
[1]  6  3 10  5 16  8  4  2  1

$`7`
 [1]  7 22 11 34 17 52 26 13 40 20 10  5 16  8  4  2  1

$`8`
[1] 8 4 2 1

$`9`
 [1]  9 28 14  7 22 11 34 17 52 26 13 40 20 10  5 16  8  4  2  1

$`10`
[1] 10  5 16  8  4  2  1

If these were in a dataframe as above, a filter would pick out the sequences for particular starting points. As an example, we will pick out the sequences for odd-numbered starting points. Here, this allows us to learn about the new keep_at and discard_at. There is already keep and discard,⁹ for selecting by value, but the new ones allow selecting by name.

There are different ways to use keep_at, but one is to write a function that accepts a name and returns TRUE if that is one of the names you want to keep. Mine is below. The names are text, so I convert the name to an integer and then test it for oddness as we did in hotpo:

keep the sequences for odd-numbered starting points

is_odd <- function(x) {
  x <- as.integer(x)
  x %% 2 == 1
}
is_odd(3)

[1] TRUE

is_odd(4)

[1] FALSE

and now I keep the sequences that have odd starting points thus:

seq_list %>% 
  keep_at(\(x) is_odd(x))

$`1`
[1] 1

$`3`
[1]  3 10  5 16  8  4  2  1

$`5`
[1]  5 16  8  4  2  1

$`7`
 [1]  7 22 11 34 17 52 26 13 40 20 10  5 16  8  4  2  1

$`9`
 [1]  9 28 14  7 22 11 34 17 52 26 13 40 20 10  5 16  8  4  2  1

discard_at selects the ones for which the helper function is FALSE, which in this case will give us the even-numbered starting points:

seq_list %>% 
  discard_at(\(x) is_odd(x))

$`2`
[1] 2 1

$`4`
[1] 4 2 1

$`6`
[1]  6  3 10  5 16  8  4  2  1

$`8`
[1] 8 4 2 1

$`10`
[1] 10  5 16  8  4  2  1

Final thoughts

I have long been a devotee of the lambda-function notation with a map:

x <- 1:5
map_dbl(x, ~sqrt1(.))

[1] 1.000000 1.414214 1.732051 2.000000 2.236068

but I have always had vague misgivings about teaching this, because it is not immediately obvious why the thing inside sqrt1 is not also x. The reason, of course, is the same as this in Python:

x = ['a', 'b', 'c']
for i in x:
  print(i)

a
b
c

where i stands for “the element of x that I am currently looking at”, but it takes a bit of thinking for the learner to get to that point.

Using the anonymous function approach makes things a bit clearer:

x <- 1:5
map_dbl(x, \(x) sqrt1(x))

[1] 1.000000 1.414214 1.732051 2.000000 2.236068

where x appears three times in the map, first as the vector of values of which we want the square roots, and then as the input to sqrt1, so that everything appears to line up.

But there is some sleight of hand here: the meaning of x actually changes as you go along! The first x is a vector, but the second and third x values are numbers, elements of the vector x. Maybe this is all right, because we are used to treating vectors elementwise in R:

tibble(x = 1:5) %>% 
  mutate(root = sqrt(x))

# A tibble: 5 × 2
      x  root
  <int> <dbl>
1     1  1   
2     2  1.41
3     3  1.73
4     4  2   
5     5  2.24

Functions like sqrt are vectorized, so the mutate really means something like “take the elements of x one at a time and take the square root of each one, gluing the result back together into a vector”. So, in the grand scheme of things, I am sold on the (new) anonymous function way of running map, and I think I will be using this rather than the lambda-function way of doing things in the future.

Now, if you’ll excuse me, I now have to attend to all the times I’ve used map in my lecture notes!

R’s log function has two arguments: the number whose log you want, and then the base of the log, which defaults to \(e\).↩︎
Ignoring the fact that log is vectorized.↩︎
The logic here seems to require the vector to have a singular name.↩︎
The input to the anonymous function could be called anything, but it seems like a waste to not use the same name as the column being for-eached over.↩︎
%/% is integer division, discarding the remainder, and %% is the remainder itself. We need to be careful with the division because, for example, 4 / 2 is actually a decimal number, what we old FORTRAN programmers used to write as 2.0 or 2..↩︎
Spoiler: nobody has been able to prove that this is always true, but every starting point that has been tried gets to 1.↩︎
Using plain map means that its output will be a list, and in a dataframe will result in the new column being a list-column with something more than a single number stored in each cell.↩︎
I am a little bothered by most of them being dbl rather than int.↩︎
I must be having flashbacks of SAS, because I expected the opposite of “keep” to be “drop”.↩︎

Looking in on Purrr 1.0