Sharing a data.table in memory for parallel computing

Question

Following the post about data.table and parallel computing, I'm trying to find a way to get an operation on a data.table parallized.

I have a data.table with 4 million rows of 14 observations and would like to share it in a common memory so that operations on it can be parallelized by using the "parallel"-package with parLapply without having to copy the table for each node in the cluster (what parLapply does). At the moment the costs for moving the data.table around are bigger than the benefit of parallel computation.

I found the "bigmemory"-package as an answer for sharing memory, but it doesn't maintain the "data.table"-structure of the data. So does anyone know a way to:

1) put the data.table in shared memory

2) maintain the "data.table"-structure of the data by doing so

3) use parallel processing on this data.table?

Thanks in advance!

rookie rookie · Accepted Answer · 2017-08-10T10:55:44

Old question, but here is an answer since nobody else has answered and it might be helpful. I assume the problem you are having is because you are on windows and having to use the PSOCK type of cluster. Unfortunately for windows this means you have to copy the data to each node. However, there is a work around. Get hold of docker and spin up an Rserve instance on the docker vm (e.g. stevenpollack/docker-rserve). Since this will be linux based you can create a FORK cluster on the docker vm. Then using your native R instance you can send over only once copy of the data to the Rserve instance (check out the RSclient library), do your parallelized job on the vm, and collect the results back into your native R.

Sharing a data.table in memory for parallel computing

2 Answers