6
votes

I am a beginner at R programming language and currently try to work on a project. There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame. However due to the restrictions of the functions, I am not able to do so.

The method that I have been using is to first convert it into a matrix, and then convert it to data frame.

DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)

It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:

Error: cannot allocate vector of size 2409.3 Gb

Tried looking online for a few days however I am not able to find a solution. Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).

2
Probably not, the authors are different and the desired memory allocation here is very large. DTMs tend to be be sparse so it can be dangerous to trice to naively convert them to (non-sparse) matrices. - beigel

2 Answers

8
votes

In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.

as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.

In your example here you'd run

library(tidytext)
DF <- tidy(DTM)

There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.

1
votes

It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.

The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.

as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html

data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html