1
votes

I have a column of titles in a table and would like to delete all words that are listed in a separate table/vector.

For example, table of titles:

"Lorem ipsum dolor"
"sit amet, consectetur adipiscing"
"elit, sed do eiusmod tempor"
"incididunt ut labore"
"et dolore magna aliqua."

To be deleted: c("Lorem", "dolore", "elit")

output:

"ipsum dolor"
"sit amet, consectetur adipiscing"
", sed do eiusmod tempor"
"incididunt ut labore"
"et magna aliqua."

The blacklisted words can occur multiple times.

The tm package has this functionality, but when applied to a wordcloud. What I would need is to leave the column intact rather than joining all the rows into one string of characters. Regex functions (gsub())don't seem to function when given a set of values as a pattern. An Oracle SQL solution would also be interesting.

4
thanks, but as written in the question, I wasn't able to use a set of values as a pattern for a regex expression - am I missing something? - Tomek P
combine the gsub() functionality with a loop. - Berecht

4 Answers

2
votes

First read the data:

dat <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")

We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:

gsub(paste0(todelete, collapse = "|"), "", dat)
3
votes
lorem <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")

to.delete <- c("Lorem", "dolore", "elit")

output <- lorem
for (i in to.delete) {
  output <- gsub(i, "", output)
}

This gives:

[1] " ipsum dolor"                     "sit amet, consectetur adipiscing"
[3] ", sed do eiusmod tempor"          "incididunt ut labore"            
[5] "et  magna aliqua."
2
votes

You could also use stri_replace_all_fixed:

library(stringi)
lorem <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")

to.delete <- c("Lorem", "dolore", "elit")

#just a simple function call
library(stringi)
stri_replace_all_fixed(lorem, to.delete, '')

Output:

[1] " ipsum dolor"                     "sit amet, consectetur adipiscing" ", sed do eiusmod tempor"         
[4] "incididunt ut labore"             "et  magna aliqua."               
2
votes

The tm-Package has a function implemented for that: tm:::removeWords.character

It is implemented as follows:

foo <- function(x, words){
  gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), 
                                         collapse = "|")), "", x, perl = TRUE)
}

Which gives you

gsub("(*UCP)\\b(Lorem|elit|dolore)\\b","", x, perl = TRUE)