I've been playing around with the R interface to the redis database, as well as the doRedis parallel backend for foreach. I have a couple of questions, to help me better apply this tool:
- doMC, doSMP, doSnow, etc. all seem to work by calling up worker processes on the same computer, passing them an element from a list and a function to apply, and then gathering the results. In the case of doMC, the workers share memory. However, I'm a little bit confused as to how a database can provide this same functionality.
- When I add an additional slave computer to the doRedis job queue (as in this video), is the entire doredis database being sent to the slave computer? Or is each slave just the data it needs at a particular moment (i.e. one element of a list and a function to apply).
- How do I explicitly pass additional data and functions to the doRedis job queue, that each slave will need to perform it's computations?
- When using doRedis and foreach, are there any additional 'gotchas' that might not apply to other parallel backends?
I know this is a lot of questions, but I've been running into situations where my limited understanding of how parallel processing works has been hindering my abilities to implement it. For example, I recently tried to parallelize a computation on a large database, and caught myself passing the entire database to each node on my cluster, an operation which completely destroyed any advantage I'd gained from parallelizing.
Thank you!