3
votes

I have a term doc matrix (16,977 terms, 29,414 documents):

Non-/sparse entries: 355000/499006478
Sparsity           : 100%
Maximal term length: 7 
Weighting          : term frequency (tf)

For further analysis I got to restrict the term number to 2,425. How can I generate a new term doc matrix by including terms with freq over and above 20 for instance?

Since the matrix is large, traditional method as.matrix cannot be applied.

2

2 Answers

5
votes

Something like this might work... Index the DTM as a simple triplet matrix using a function from the slam package, that'll save you from having to convert it to a dense matrix.

library(slam)
library(tm)
data(crude)
dtm1 <- DocumentTermMatrix(crude)


# Find the total occurances of each word in all docs
colTotals <-  col_sums(dtm1)

# keep only  words that occur >20 times in all docs
dtm2 <- dtm1[,which(colTotals > 20)]

> dtm1
A document-term matrix (20 documents, 1266 terms)

Non-/sparse entries: 2255/23065
Sparsity           : 91%
Maximal term length: 17 
Weighting          : term frequency (tf)

> dtm2
A document-term matrix (20 documents, 12 terms)

Non-/sparse entries: 174/66
Sparsity           : 28%
Maximal term length: 6 
Weighting          : term frequency (tf)

Does that work on your data and answer your question?

0
votes

I think this is possible with a control list

 library(tm)
 dtm <- DocumentTermMatrix(your.corpus, control = list(
          bounds=list(global=c(20,Inf))
 ))
 inspect(dtm)