Python Multiprocessing, Difference between Process and Pool, Pickling of Data

Question

I am wondering what is the difference between starting a Pool of workers to manage a task in parallel or to start individual processes when it comes to pickling and distributing jobs.

I do have a task (here do_my_job) whose objects cannot be pickled. Thus, I cannot start a pool of workers to execute the task in parallel. The following snippet does NOT work, where iterator iterates over different parameters settings for do_my_job:

import multiprocessing as multip

mpool = multip.Pool(ncores)
mpool.map(do_my_job, iterator)
mpool.close()
mpool.join()

Yet, the following code snippet DOES work:

import time
import multiprocessing as multip

keep_running=True
process_list = []

while len(process_list)>0 or keep_running:

    terminated_procs = []
    for idx, proc in enumerate(process_list):

        if not proc.is_alive():
            terminated_procs.append(idx)

    for terminated_proc in terminated_procs:
        process_list.pop(terminated_proc)

    if len(process_list) < ncores and keep_running:
        try:
            task = iterator.next()
            proc = multip.Process(target=do_my_job,
                                                   args=(task,))

            proc.start()
            process_list.append(proc)
        except StopIteration:
            keep_running=False


    time.sleep(0.1)

How is my job in the latter case distributed to the individual processes? Is there not step of pickling the task and all related objects involved before a process is started? If not how are the task and objects passed to the new processes?

smeso smeso · Accepted Answer · 2013-12-20T11:29:00

When you fork a new process the child process will inherit his parent data. So, if the parent defines a variable before forking, the child will be able to see it as it were its own variable. After the fork syscall child and parent process should use some IPC to share data between them. When you create a Pool you are forking N processes, then, when you call map, you pass them your data. But, because the processes were already forked, the only way to share this data is using IPC which involves "pickling" the object. In the latter case you are forking after creating the data, so the child process is able to access it as it where its own. I think that the best thing that you could do would be to make your object "pickable".

Python Multiprocessing, Difference between Process and Pool, Pickling of Data

1 Answers