I'm trying to do text mining on big data in R with tm
.
I run into memory issues frequently (such as can not allocation vector of size....
) and use the established methods of troubleshooting those issues, such as
- using 64-bit R
- trying different OS's (Windows, Linux, Solaris, etc)
- setting
memory.limit()
to its maximum - making sure that sufficient RAM and compute is available on the server (which there is)
- making liberal use of
gc()
- profiling the code for bottlenecks
- breaking up big operations into multiple smaller operations
However, when trying to run Corpus
on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:
> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)
Can (and should) I run Corpus
incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?
The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude
dataset and replicate the documents until it's large enough, then you can replicate the error.
UPDATE
I've been experimenting with trying to combine smaller corpa, i.e.
test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]
ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))
and while I haven't been successful, I did discover tm_combine
which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm
can't find the function tm_combine
. Perhaps it was removed from the package for some reason? I'm investigating...
> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"
Corpus
command in the environment though. - Hack-R