How can I use the tm_map, removeWords, function with regex values?

Question

I am working with a list of previously clustered re-tweet usernames, which I would like to upload in a Document-Term-Matrix for further comparison per cluster. Each cluster is hereby stored as a seperate document.

Some of the usernames as taken from the original re-tweet data have not been clearly extracted, so that they still contain the phrases @userxxx @user123, etc.

I would now like to clean these remnants using the (tm_map, removeStopwords) function and specify that all words starting with an @ should be removed from my corpus.

The way I imagined that is as follows ("docs" is my previously established VCorpus):

#clean docs from remaining @retweets
docs <- VCorpus(DirSource(c))

docs <- tm_map(docs, removeWords, regex("@*"))

dtm <- DocumentTermMatrix(docs)

However, I am not fully aware if it is possible to establish a regex-value within the removeWords function, and if so, how I would need to do it.

I would be very happy for suggestions on how to deal with this. If I run the code, it does not produce any errors but it also does not produce the expected results.

Thanks in advance!

hyde hyde · Accepted Answer · 2020-05-29T09:07:08

The solution I have found was now was

docs <- tm_map(docs, content_transformer(function(x) gsub(x, pattern = "@.*", replacement = "")))

It does what I needed.

How can I use the tm_map, removeWords, function with regex values?

1 Answers