Python multiprocessing on For Loop

Question

First of all, I know there are quite some threads about multiprocessing on python already, but none of these seems to solve my problem.

Here is my problem: I want to implement Random Forest Algorithm, and a naive way to do so would be like this:

def random_tree(Data):
    tree = calculation(Data)
    forest.append(tree)

forest = list()
for i in range(300):
    random_tree(Data)

And theforest with 300 "trees" inside would be my final result. In this case, how do I turn this code into a multiprocessing version?

Update: I just tried Mukund M K's method, in a very simplified script:

from multiprocessing import Pool

def f(x):
    return 2*x

data = np.array([1,2,5])

pool = Pool(processes=4)
forest = pool.map(f, (data for i in range(4))) 
# I use range() instead of xrange() because I am using Python 3.4

And now....the script is running like forever.....I open a python shell and enter the script line by line, and this is the messages I've got:

> Process SpawnPoolWorker-1:  
> Process SpawnPoolWorker-2:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-3:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-4:  
> Traceback (most recent call last):  
> Traceback (most recent call last):  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on   
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on

Update: I edited my sample code according to some other example code like this:

from multiprocessing import Pool
import numpy as np

def f(x):
    return 2*x

if __name__ == '__main__':
    data = np.array([1,2,3])
    with Pool(5) as p:
        result = p.map(f, (data for i in range(300)))

And it works now. What I need to do now is to fill in this with more sophisticated algorithm now..
Yet another question in my mind is: why could this code work, while the previous version couldn't?

are you just reading it or modifying the contents as well in calculation? if so, does the order in which it is modified matter? — Mukund M K
I only read the data. In random forest algorithm, I would randomly sample from the original data ("Data")to build a tree. So every iteration is independent, that is why I think it should be able to parallelized. — Sidney
i know this is old but just in case. the cultprit here probably is the missing if __name__ == '__main__':. if you read the multiprocessing python docs you will find that this is an explicit requirement for mp to work. — Ramon

Mukund M K Mukund M K · Accepted Answer · 2015-12-25T15:31:47

You can do it with multiprocessing this way:

from multiprocessing import Pool

def random_tree(Data):
    return calculation(Data)

pool = Pool(processes=4)
forest = pool.map(random_tree, (Data for i in range(300)))

Python multiprocessing on For Loop

2 Answers