3
votes

I am having a trouble in the tm package of R. I am using 0.6.2 version. Following question (2 different errors) has already been answered here and here but still producing an error after using the posted solution. Please click here to download the dataset (93 rows only). It's a reproducible example. the two errors are below:

  1. (Resolved) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"

  2. Error: inherits(doc, "TextDocument") is not TRUE

  3. tm_map(ds.corpus, PlainTextDocument) does not create a plain text document in this case. inherits(ds.cleanCorpus, "TextDocument") # returns FALSE

please tell me what is wrong in my approach.

--

  # Data import
    df.imp<- read.csv("Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

   ##### Data Pre-Processing 

        install.packages("tm")
    require(tm)  

    ds.corpus<- Corpus(VectorSource(df.imp$Content))

    ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
    removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
    ds.corpus<- tm_map(ds.corpus,removeURL)

    stopwords.default<- stopwords("english")
    stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                            "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                            "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                            "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

    stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
    ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

    copy<- ds.corpus ## creating a copy to be used as a dictionary

    ds.corpus<- tm_map(ds.corpus, stemDocument)

    ## error Statement #1
    ds.corpus<-  stemCompletion(ds.corpus, dictionary = copy) 
    ## Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"




    ds.cleanCorpus<- tm_map(ds.corpus, PlainTextDocument) ## creating plain text document

    class(ds.cleanCorpus) ## output is VCorpus" "Corpus".  what it should be??

    ## error Statement #2
    tdm<- TermDocumentMatrix(ds.corpus) ## creating  term document matrix 

    inherits(ds.cleanCorpus, "TextDocument") ## returns FALSE

Update: Figured out first error, that the stemCompletion method's x parameter should be a character vector and dictionary could be either a corpus or character vector. However, when I tried it on first document (character vector) of ds.corpus, as below, stemmed words were not completed and output is just the stemmed character vector like before.

stemCompletion(ds.corpus[[1]]$content, dictionary = copy) 

So now my main question is "How to complete a stemmed corpus from a dictionary (tm package)?" The stemCompletion method doesn't seems working (on a character vector). Secondly, how can I complete the stemming of an entire corpus, should I use a for loop for each document of the corpus's content?

2
Hmm have a look at ?stemCompletion. In stemCompletion(ds.corpus,stemCompletion, dictionary = copy) you are passing an object of type Corpus to an argument that should be of type character, and... welll... I dunno where the 2nd argument stemCompletion should go. Maybe you should clarify what you are trying to accomplish...?lukeA
I am just making myself comfortable with the functions of tm package. Here, I am performing basic data pre-processing before building out a model. Please check out the links in the question to get a better reference. @MrFlickJasmeet

2 Answers

3
votes

There are 2 things you need to change

  1. When you use a custom function you need to use content_transformer

    removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

    ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))

  2. The purpose of the function stemCompletion is to try to complete a stemmed word https://en.wikipedia.org/wiki/Stemming based on a dictionary. The stemmed words need to be a character vector and dictionary can be a corpus.

    x <- c("compan", "entit", "suppl") stemCompletion(x, copy)

output:

 compan       entit       suppl 

"companies" "" "supplies"

Code to create Document Term Matrix

# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing 

#install.packages("tm")
require(tm)  

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))


stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                        "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                        "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                        "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

tdm<- TermDocumentMatrix(ds.corpus)

Example to complete stemmed words

copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)
1
votes

not sure if you have found the solution already. I have been informed by this post stemCompletion is not working and I believe it solves your second questions of "How to complete a stemmed corpus from a dictionary (tm package)?" (as well as mine, which is similar to yours). Specifically, you can try the following code:

    stem_completion <- tm_map(ds.corpus, 
                       content_transformer(function(x, d)
                         paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), 
                               collapse = ' ')), d = copy)