14
votes

I have a need to split on words and end marks (punctuation of certain types). Oddly pipe ("|") can count as an end mark. I have code that words on end marks until I try to add the pipe. Adding the pipe makes the strsplit every character. Escaping it causes and error. How can I include the pipe int he regular expression?

x <- "I like the dog|."

strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE)
#[[1]]
#[1] "I"    "like" "the"  "dog|" "."   

strsplit(x, "[[:space:]]|(?=[.!?*-\|])", perl=TRUE)
#Error: '\|' is an unrecognized escape in character string starting "[[:space:]]|(?=[.!?*-\|"

The outcome I'd like:

#[[1]]
#[1] "I"    "like" "the"  "dog"  "|"  "."  #pipe is an element
2
I am always hesitant to put regex tags on R regex questions because you get regexers from other languages and though the answers are similar they don't overlap. - Tyler Rinker

2 Answers

19
votes

One way to solve this is to use the \Q...\E notation to remove the special meaning of any of the characters in .... As it says in ?regex:

If you want to remove the special meaning from a sequence of characters, you can do so by putting them between ‘\Q’ and ‘\E’. This is different from Perl in that ‘$’ and ‘@’ are handled as literals in ‘\Q...\E’ sequences in PCRE, whereas in Perl, ‘$’ and ‘@’ cause variable interpolation.

For example:

> strsplit(x, "[[:space:]]|(?=[\\Q.!?*-|\\E])", perl=TRUE)
[[1]]
[1] "I"    "like" "the"  "dog"  "|"    "."
12
votes

The problem is actually your hyphen, which should come either first or last:

strsplit(x, "[[:space:]]|(?=[|.!?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[.|!?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[.!|?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[-|.!?*])", perl=TRUE)

and so on should all give you the output you are looking for.

You can also escape the hyphen if you prefer, but remember to use two backslashes!

strsplit(x, "[[:space:]]|(?=[.!?*\\-|])", perl=TRUE)