8
votes

I'm trying to clean the corpus and I've used the typical steps, like the code below:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated

It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.

Any help would be greatly appreciated.

3
Could you provide a reproducible example? - user3710546
I'm sorry, I don't quite understand 'reproducible example'? - anonymous
I used the code above for a document that contained the sentence : 'For purposes of this Agreement, “Separation from Service Date” shall mean the date of the Executive’s separation from service within the meaning of Section 409A(a)(2)(i)(A) of the Code and determined in accordance with the default rules under Section 409A of the Code'. Still doesn't clean properly. - anonymous

3 Answers

10
votes

removePunctuation uses gsub('[[:punct:]]','',x) i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\\]^_{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:

removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)

Or you can go further and remove everything that is not alphanumerical symbol or space:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)
1
votes

A better constructed tokenizer will handle this automatically. Try this:

> require(quanteda)
> text <- c("Enjoying \"my time\".", "Single 'air quotes'.")
> toktexts <- tokenize(toLower(text), removePunct = TRUE, removeNumbers = TRUE)
> toktexts
[[1]]
[1] "enjoying" "my"       "time"    

[[2]]
[1] "single" "air"    "quotes"

attr(,"class")
[1] "tokenizedTexts" "list"          
> dfm(toktexts, stem = TRUE, ignoredFeatures = stopwords("english"), verbose = FALSE)
Creating a dfm from a tokenizedTexts object ...
   ... indexing 2 documents
   ... shaping tokens into data.table, found 6 total tokens
   ... stemming the tokens (english)
   ... ignoring 174 feature types, discarding 1 total features (16.7%)
   ... summing tokens by document
   ... indexing 5 feature types
   ... building sparse matrix
   ... created a 2 x 5 sparse dfm
   ... complete. Elapsed time: 0.016 seconds.
Document-feature matrix of: 2 documents, 5 features.
2 x 5 sparse Matrix of class "dfmSparse"
       features
docs    air enjoy quot singl time
  text1   0     1    0     0    1
  text2   1     0    1     1    0
0
votes

Answer by @cyberj0g requires a small modification for latest version of tm (0.6). Updated code can be written as follow:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))

Thank you @cyberj0g for working code