1
votes

Hi this must be super basic:

I am using tm package to create a document term matrix from a corpus, so the column names of my matrix are the indices of the terms in my corpus. Could anyone be so nice to tell me how to inspect the original words in my corpus that correspond to these indices in the matrix? Thank you so much!!

2
inspect(document term matrix) otherwise create a reproducible example of what you exactly want to see - phiver

2 Answers

2
votes

It's actually the row names that are the indices of the terms. Is this what you wanted?

library(tm)
docs <- c("This is a text.", 
          "This another one.", 
          "This is some more.")

cor  <- Corpus(VectorSource(docs))
tdm  <- TermDocumentMatrix(cor, control=list(tolower=TRUE, removePunctuation=TRUE))
as.matrix(tdm)
#          Docs
# Terms     1 2 3
#   another 0 1 0
#   more    0 0 1
#   one     0 1 0
#   some    0 0 1
#   text    1 0 0
#   this    1 1 1

In future please be sure to include a representative example of your data.

0
votes

I slightly suspect you are trying to connect the words back to the original corpus and get the corpus indices. If so you can do this using @jlhoward's example:

library(tm)
docs <- c("This is a text. And me too.", 
          "This another one.", 
          "This is some more.")

cor  <- Corpus(VectorSource(docs))
tdm  <- TermDocumentMatrix(cor, control=list(tolower=TRUE, removePunctuation=TRUE))
as.matrix(tdm)

txt <- sapply(cor, function(x) x[[1]])


setNames(lapply(rownames(tdm), function(x){
   grep(x, txt, ignore.case=TRUE)
}), rownames(tdm))

## $another
## [1] 2
## 
## $more
## [1] 3
## 
## $one
## [1] 2
## 
## $some
## [1] 3
## 
## $text
## [1] 1
## 
## $this
## [1] 1 2 3