Making a lot of changes in text, all in one go
Let’s suppose you have a data frame like this:
d
# A tibble: 5 × 3
x1 x2 y
<chr> <chr> <chr>
1 one two two
2 four three four
3 seven nine eight
4 six eight seven
5 fourteen nine twelve
What you want to do is to change all the even numbers in columns x1
and x2
, but not y
, to the number versions of themselves, so that, for example, eight
becomes 8
. This would seem to be a job for str_replace_all
, but how to manage the multitude of changes?
str_replace_all
I learned today that you can feed str_replace_all
a named vector. Wossat, you say? Well, one of these:
quantile(1:7)
0% 25% 50% 75% 100%
1.0 2.5 4.0 5.5 7.0
The numbers are here the five-number summary; the things next to them, that say which percentile they are, are the names
attribute. You can make one of these yourself like this:
The value of this for us is that you can feed the boatload of potential changes into str_replace_all
by feeding it a named vector of the changes it might make.
In our example, we wanted to replace the even numbers by the numeric versions of themselves, so let’s make a little data frame with all of those:
changes <- tribble(
~from, ~to,
"two", "2",
"four", "4",
"six", "6",
"eight", "8",
"ten", "10",
"twelve", "12",
"fourteen", "14"
)
I think this is as high as we need to go. I like a tribble
for this so that you can easily see what is going to replace what.
For the named vector, the values are the new values (the ones I called to
in changes
), while the names are the old ones (from
). So let’s construct that. There is one extra thing: I want to replace whole words only (and not end up with something like 4teen
, which sounds like one of those 90s boy bands), so what I’ll do is to put “word boundaries”1 around the from
values:2
my_changes <- changes$to
names(my_changes) <- str_c("\\b", changes$from, "\\b")
my_changes
\\btwo\\b \\bfour\\b \\bsix\\b \\beight\\b
"2" "4" "6" "8"
\\bten\\b \\btwelve\\b \\bfourteen\\b
"10" "12" "14"
and that seems to reflect the changes we want to make. So let’s make it go, just on columns x1
and x2
:3
d %>% mutate_at(
vars(starts_with("x")),
~ str_replace_all(., my_changes)
)
# A tibble: 5 × 3
x1 x2 y
<chr> <chr> <chr>
1 one 2 two
2 4 three four
3 seven nine eight
4 6 8 seven
5 14 nine twelve
“for each of the columns that starts with x
, replace everything in it that’s in the recipe in my_changes
.”
It seems to have worked, and not a 90s boy band in sight.
This Stack Overflow answer explains why the backslashes need to be doubled. The answer is for Python, but the same issue applies to R.↩︎
This means that the number names only match if they are surrounded by non-word characters, that is, spaces, or the beginning or end of the text.↩︎
The modern way to do this is to use across
, but I wrote this post in 2019, and this is all we had then.↩︎
For attribution, please cite this work as
Butler (2019, May 12). Ken's Blog: Changing a lot of things in a lot of places. Retrieved from http://ritsokiguess.site/blogg/posts/2021-11-19-changing-a-lot-of-things-in-a-lot-of-places/
BibTeX citation
@misc{butler2019changing, author = {Butler, Ken}, title = {Ken's Blog: Changing a lot of things in a lot of places}, url = {http://ritsokiguess.site/blogg/posts/2021-11-19-changing-a-lot-of-things-in-a-lot-of-places/}, year = {2019} }