7
votes

Following the post about data.table and parallel computing, I'm trying to find a way to get an operation on a data.table parallized.

I have a data.table with 4 million rows of 14 observations and would like to share it in a common memory so that operations on it can be parallelized by using the "parallel"-package with parLapply without having to copy the table for each node in the cluster (what parLapply does). At the moment the costs for moving the data.table around are bigger than the benefit of parallel computation.

I found the "bigmemory"-package as an answer for sharing memory, but it doesn't maintain the "data.table"-structure of the data. So does anyone know a way to:

1) put the data.table in shared memory

2) maintain the "data.table"-structure of the data by doing so

3) use parallel processing on this data.table?

Thanks in advance!

2

2 Answers

1
votes

Old question, but here is an answer since nobody else has answered and it might be helpful. I assume the problem you are having is because you are on windows and having to use the PSOCK type of cluster. Unfortunately for windows this means you have to copy the data to each node. However, there is a work around. Get hold of docker and spin up an Rserve instance on the docker vm (e.g. stevenpollack/docker-rserve). Since this will be linux based you can create a FORK cluster on the docker vm. Then using your native R instance you can send over only once copy of the data to the Rserve instance (check out the RSclient library), do your parallelized job on the vm, and collect the results back into your native R.

0
votes

The "complete" solution, shared read and write access from multiple processes, and their problems is discussed here: https://github.com/Rdatatable/data.table/issues/3104

As rookie mentioned, if you fork an R process (with parallel::makeCluster(type = "FORK") or future::plan(multicore) (note that this does not work reliably in RStudio), the operating system will reuse memory pages that are not modified by the child process. So, your workers will share the same memory as long as they don't modify it (Copy-on-write). But this works only if you have all parallel workers on the same machine and fork() has its own problems (although this might be going too far if you simply want to conduct some parallel analysis).

Meanwhile, you could find the packages feather and fst interesting. feather provides a file format that can be read both by R and python and if I understood the docs correctly, feather::feather() gives you a file-backed read-only data-frame, albeit no data.table. This allows for moving data between those two languages.

fst employs the Zstandard compression algorithm to achieve very fast reading and writing speeds to disk. You can read in a part of a fst file using the fst() function (instead of read_fst()). So, every worker could just read the part of your table that it needs. Concurrent writing to the fst file is not possible. You would need to save every result in its own file and concatenate them afterwards.

Alternatively, for concurrent reading and writing, you could switch to a database, albeit that is slower than data.table. See SO/SQLite concurrent access