Use tm's Corpus function with big data in R

Question

I'm trying to do text mining on big data in R with tm.

I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as

using 64-bit R
trying different OS's (Windows, Linux, Solaris, etc)
setting memory.limit() to its maximum
making sure that sufficient RAM and compute is available on the server (which there is)
making liberal use of gc()
profiling the code for bottlenecks
breaking up big operations into multiple smaller operations

However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:

> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)

Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?

The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.

UPDATE

I've been experimenting with trying to combine smaller corpa, i.e.

test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...

> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"

I'm guessing that your code may be making too many copies...but I'll wait for the experts. — Rich Scriven
Fair enough. Right now it's just the source dataframe and the non-working Corpus command in the environment though. — Hack-R

Hack-R Hack-R · Accepted Answer · 2014-08-27T19:18:19

I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.

This StackOverflow post had a simple way to do that without tm_combine:

test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)

which gives you:

ds.12

<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>

Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.

Use tm's Corpus function with big data in R

1 Answers