3
votes

I've got the following code, which I expect to give me a list of 3, since there are 3 elements in texts:

library(stringr)
texts <- c("I doubt it! :)", ";) disagree, but ok.", "No emoticons here!!!")
smileys <- c(":)","(:",";)",":D")
str_extract_all(texts, fixed(smileys))

Instead, I get a list of four (the length of my "pattern" parameter, here the smileys. Additionally, I get the following warning message:

Warning message: In stri_extract_all_fixed(string, pattern, simplify = simplify, : longer object length is not a multiple of shorter object length```

Well, I don't imagine length will match, as I'm looking for any hits on any of the smileys in each text. It's not like I want to match string 1 with pattern 1, string 2 with pattern 2, etc.

Aware that I am messing up stringi's understanding of vectorizing, I have tried this instead:

texts %>% map(~ str_extract_all(.x, fixed(smileys)))

This is much better, as it gives me a list of 3, but each element is in turn a list of four.

What I'm trying to get to is a list of 3 that is as little nested as possible. Someone, somewhere, has solved this, but I can't for the life of me figure it out or get how to google it. I could do a for loop over this, but I consider myself a citizen of the tidyverse...

Grateful for any assistance.

1
Not familiar with stringr, but I believe you may have look at grep using a character vector with multiple patterns. If you pursue the "paste collapse = |" method, then you might need to consider How do I deal with special characters like \^$.?*|+()[{ in my regex?Henrik
Not sure if this is what you're looking for, but you can try something like this: pattern <- paste("\\Q", smileys, "\\E", sep = "", collapse = "|"); stringi::stri_extract_all_regex(texts, pattern)Jota
Yeah, the issue of just pasting things together with the pipe is that I'd have to escape out all the parentheses, colons, etc., which make up a lot of the smileys!Joy
Guilty as charged, @Jota! I didn't try it before commenting. Your solution works like a charm! Feel free to post it as an answer and I'll mark it as correct.Joy
@Joy The Q/E method is described in the second link I provided.Henrik

1 Answers

2
votes

You can use paste to wrap each element of smiley with \\Q and \\E and collapse on the regex "or" metacharacter (|) to form a single pattern. As mentioned in the link Henrik shared and documented on ?regex and in the stringi manual, characters between \\Q and \\E are interpreted literally.

pattern <- paste("\\Q", smileys, "\\E", sep = "", collapse = "|")
# [1] "\\Q:)\\E|\\Q(:\\E|\\Q;)\\E|\\Q:D\\E"

library(stringi)
stri_extract_all_regex(texts, pattern)
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#[1] NA

Base R:

regmatches(texts, gregexpr(pattern, texts))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)  
       # If you want an NA, instead of a zero-length vector, 
       # then you could do something like:
       # lapply(
       #   regmatches(texts, gregexpr(pattern, texts)), 
       #   function(ii) ifelse(is.character(ii) & length(ii) == 0L, NA, ii))

And if you do want to use purrr and avoid regular expressions, one idea would be something like this:

library(purrr)
library(stringr)
texts %>% 
  map(~ unlist(str_extract_all(.x, fixed(smileys))))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)  
       # if you want NA, not a zero-length vector, you could add:
       # %>% map(~ ifelse(is.character(.x) & length(.x) == 0L, NA, .x))