I am working with a list of previously clustered re-tweet usernames, which I would like to upload in a Document-Term-Matrix for further comparison per cluster. Each cluster is hereby stored as a seperate document.
Some of the usernames as taken from the original re-tweet data have not been clearly extracted, so that they still contain the phrases @userxxx @user123, etc.
I would now like to clean these remnants using the (tm_map, removeStopwords) function and specify that all words starting with an @ should be removed from my corpus.
The way I imagined that is as follows ("docs" is my previously established VCorpus):
#clean docs from remaining @retweets
docs <- VCorpus(DirSource(c))
docs <- tm_map(docs, removeWords, regex("@*"))
dtm <- DocumentTermMatrix(docs)
However, I am not fully aware if it is possible to establish a regex-value within the removeWords function, and if so, how I would need to do it.
I would be very happy for suggestions on how to deal with this. If I run the code, it does not produce any errors but it also does not produce the expected results.
Thanks in advance!