How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

Question

I am having a trouble in the tm package of R. I am using 0.6.2 version. Following question (2 different errors) has already been answered here and here but still producing an error after using the posted solution. Please click here to download the dataset (93 rows only). It's a reproducible example. the two errors are below:

(Resolved) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"
Error: inherits(doc, "TextDocument") is not TRUE
tm_map(ds.corpus, PlainTextDocument) does not create a plain text document in this case. inherits(ds.cleanCorpus, "TextDocument") # returns FALSE

please tell me what is wrong in my approach.

--

  # Data import
    df.imp<- read.csv("Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

   ##### Data Pre-Processing 

        install.packages("tm")
    require(tm)  

    ds.corpus<- Corpus(VectorSource(df.imp$Content))

    ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
    removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
    ds.corpus<- tm_map(ds.corpus,removeURL)

    stopwords.default<- stopwords("english")
    stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                            "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                            "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                            "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

    stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
    ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

    copy<- ds.corpus ## creating a copy to be used as a dictionary

    ds.corpus<- tm_map(ds.corpus, stemDocument)

    ## error Statement #1
    ds.corpus<-  stemCompletion(ds.corpus, dictionary = copy) 
    ## Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"




    ds.cleanCorpus<- tm_map(ds.corpus, PlainTextDocument) ## creating plain text document

    class(ds.cleanCorpus) ## output is VCorpus" "Corpus".  what it should be??

    ## error Statement #2
    tdm<- TermDocumentMatrix(ds.corpus) ## creating  term document matrix 

    inherits(ds.cleanCorpus, "TextDocument") ## returns FALSE

Update: Figured out first error, that the stemCompletion method's x parameter should be a character vector and dictionary could be either a corpus or character vector. However, when I tried it on first document (character vector) of ds.corpus, as below, stemmed words were not completed and output is just the stemmed character vector like before.

stemCompletion(ds.corpus[[1]]$content, dictionary = copy)

So now my main question is "How to complete a stemmed corpus from a dictionary (tm package)?" The stemCompletion method doesn't seems working (on a character vector). Secondly, how can I complete the stemming of an entire corpus, should I use a for loop for each document of the corpus's content?

Hmm have a look at ?stemCompletion. In stemCompletion(ds.corpus,stemCompletion, dictionary = copy) you are passing an object of type Corpus to an argument that should be of type character, and... welll... I dunno where the 2nd argument stemCompletion should go. Maybe you should clarify what you are trying to accomplish...? — lukeA
I am just making myself comfortable with the functions of tm package. Here, I am performing basic data pre-processing before building out a model. Please check out the links in the question to get a better reference. @MrFlick — Jasmeet

amitkb3 amitkb3 · Accepted Answer · 2016-02-24T04:01:48

There are 2 things you need to change

When you use a custom function you need to use content_transformer

removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))
The purpose of the function stemCompletion is to try to complete a stemmed word https://en.wikipedia.org/wiki/Stemming based on a dictionary. The stemmed words need to be a character vector and dictionary can be a corpus.

x <- c("compan", "entit", "suppl") stemCompletion(x, copy)

output:

 compan       entit       suppl

"companies" "" "supplies"

Code to create Document Term Matrix

# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing 

#install.packages("tm")
require(tm)  

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))


stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                        "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                        "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                        "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

tdm<- TermDocumentMatrix(ds.corpus)

Example to complete stemmed words

copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)

How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

2 Answers

Example to complete stemmed words