Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords.
Web Search: Being new to text-mining and the tm
package in R
, I went to the web to figure out how to do this. Below are some relevant links that I found:
- FAQS on the tm-package website
- finding 2 & 3 word phrases using r tm package
- counter ngram with tm package in r
- findassocs for multiple terms in r
Background: Of these, I preferred the solution that uses NGramTokenizer
in the RWeka
package in R
, but I ran into a problem. In the example code below, I create three documents and place them in a corpus. Note that Docs 1
and 2
each contain two words. Doc 3
only contains one word. My dictionary keywords are two bigrams and a unigram.
Problem: The NGramTokenizer
solution in the above links does not correctly count the unigram keyword in the Doc 3
.
library(tm)
library(RWeka)
my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=BigramTokenizer,
dictionary=my.dict)))
# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
# 1 1 0 1
# 2 1 1 0
# 3 0 0 0
I was expecting the row for Doc 3
to give 1
for jedi
and 0
for the other two. Is there something I am misunderstanding?