0
votes

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.

I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.

I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).

I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.

A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.

Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.

Thank you!

3

3 Answers

1
votes

When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.

Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.

library(text2vex)

docs <- c("the coffee is warm",
          "the coffee is cold",
          "the coffee is hot",
          "the coffee is warm",
          "the coffee is hot",
          "the coffee is perfect")


#Generate document term matrix with text2vec    
tokens = docs %>%
  word_tokenizer()

it = itoken(tokens
            ,ids = paste0("sent_", 1:length(docs))
            ,progressbar = FALSE)

vocab = create_vocabulary(it)

vectorizer = vocab_vectorizer(vocab)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
0
votes

With the corpus library:

library(corpus)
library(Matrix)

corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))

Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do

x <- term_matrix(data$your_column_with_sentences)

(replacing data$your_column_with_sentences with whatever is appropriate for your data).

-1
votes

Can't add comments so here's a suggestion:

# Read Data from file using fread (for .csv from data.table package) 
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))

This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!