Changing a lot of things in a lot of places

Making a lot of changes in text, all in one go

Ken Butler http://ritsokiguess.site/blogg
2019-05-12

Packages

Introduction

Let’s suppose you have a data frame like this:

d
# A tibble: 5 × 3
  x1       x2    y     
  <chr>    <chr> <chr> 
1 one      two   two   
2 four     three four  
3 seven    nine  eight 
4 six      eight seven 
5 fourteen nine  twelve

What you want to do is to change all the even numbers in columns x1 and x2, but not y, to the number versions of themselves, so that, for example, eight becomes 8. This would seem to be a job for str_replace_all, but how to manage the multitude of changes?

Making a lot of changes with str_replace_all

I learned today that you can feed str_replace_all a named vector. Wossat, you say? Well, one of these:

quantile(1:7)
  0%  25%  50%  75% 100% 
 1.0  2.5  4.0  5.5  7.0 

The numbers are here the five-number summary; the things next to them, that say which percentile they are, are the names attribute. You can make one of these yourself like this:

x <- 1:3
x
[1] 1 2 3
names(x) <- c("first", "second", "third")
x
 first second  third 
     1      2      3 

The value of this for us is that you can feed the boatload of potential changes into str_replace_all by feeding it a named vector of the changes it might make.

In our example, we wanted to replace the even numbers by the numeric versions of themselves, so let’s make a little data frame with all of those:

changes <- tribble(
  ~from, ~to,
  "two", "2",
  "four", "4",
  "six", "6",
  "eight", "8",
  "ten", "10",
  "twelve", "12",
  "fourteen", "14"
)

I think this is as high as we need to go. I like a tribble for this so that you can easily see what is going to replace what.

For the named vector, the values are the new values (the ones I called to in changes), while the names are the old ones (from). So let’s construct that. There is one extra thing: I want to replace whole words only (and not end up with something like 4teen, which sounds like one of those 90s boy bands), so what I’ll do is to put “word boundaries”1 around the from values:2

my_changes <- changes$to
names(my_changes) <- str_c("\\b", changes$from, "\\b")
my_changes
     \\btwo\\b     \\bfour\\b      \\bsix\\b    \\beight\\b 
           "2"            "4"            "6"            "8" 
     \\bten\\b   \\btwelve\\b \\bfourteen\\b 
          "10"           "12"           "14" 

and that seems to reflect the changes we want to make. So let’s make it go, just on columns x1 and x2:3

d %>% mutate_at(
  vars(starts_with("x")),
       ~ str_replace_all(., my_changes)
  )
# A tibble: 5 × 3
  x1    x2    y     
  <chr> <chr> <chr> 
1 one   2     two   
2 4     three four  
3 seven nine  eight 
4 6     8     seven 
5 14    nine  twelve

“for each of the columns that starts with x, replace everything in it that’s in the recipe in my_changes.”

It seems to have worked, and not a 90s boy band in sight.


  1. This Stack Overflow answer explains why the backslashes need to be doubled. The answer is for Python, but the same issue applies to R.↩︎

  2. This means that the number names only match if they are surrounded by non-word characters, that is, spaces, or the beginning or end of the text.↩︎

  3. The modern way to do this is to use across, but I wrote this post in 2019, and this is all we had then.↩︎

Citation

For attribution, please cite this work as

Butler (2019, May 12). Ken's Blog: Changing a lot of things in a lot of places. Retrieved from http://ritsokiguess.site/blogg/posts/2021-11-19-changing-a-lot-of-things-in-a-lot-of-places/

BibTeX citation

@misc{butler2019changing,
  author = {Butler, Ken},
  title = {Ken's Blog: Changing a lot of things in a lot of places},
  url = {http://ritsokiguess.site/blogg/posts/2021-11-19-changing-a-lot-of-things-in-a-lot-of-places/},
  year = {2019}
}