Quotes and hyphens not removed by tm package functions while cleaning corpus

Question

I'm trying to clean the corpus and I've used the typical steps, like the code below:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated

It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.

Any help would be greatly appreciated.

I used the code above for a document that contained the sentence : 'For purposes of this Agreement, “Separation from Service Date” shall mean the date of the Executive’s separation from service within the meaning of Section 409A(a)(2)(i)(A) of the Code and determined in accordance with the default rules under Section 409A of the Code'. Still doesn't clean properly. — anonymous

cyberj0g cyberj0g · Accepted Answer · 2015-06-23T07:58:34

removePunctuation uses gsub('[[:punct:]]','',x) i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\\]^_{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:

removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)

Or you can go further and remove everything that is not alphanumerical symbol or space:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)

Quotes and hyphens not removed by tm package functions while cleaning corpus

3 Answers