Text Mining: Getting a Sentence-Term Matrix

Question

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.

I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.

I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).

I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.

A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.

Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.

Thank you!

Manuel Bickel Manuel Bickel · Accepted Answer · 2017-10-23T19:38:39

When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.

Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.

library(text2vex)

docs <- c("the coffee is warm",
          "the coffee is cold",
          "the coffee is hot",
          "the coffee is warm",
          "the coffee is hot",
          "the coffee is perfect")


#Generate document term matrix with text2vec    
tokens = docs %>%
  word_tokenizer()

it = itoken(tokens
            ,ids = paste0("sent_", 1:length(docs))
            ,progressbar = FALSE)

vocab = create_vocabulary(it)

vectorizer = vocab_vectorizer(vocab)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")

Text Mining: Getting a Sentence-Term Matrix

3 Answers