First of all, I know there are quite some threads about multiprocessing on python already, but none of these seems to solve my problem.
Here is my problem: I want to implement Random Forest Algorithm, and a naive way to do so would be like this:
def random_tree(Data):
tree = calculation(Data)
forest.append(tree)
forest = list()
for i in range(300):
random_tree(Data)
And theforest
with 300 "trees" inside would be my final result. In this case, how do I turn this code into a multiprocessing version?
Update: I just tried Mukund M K's method, in a very simplified script:
from multiprocessing import Pool
def f(x):
return 2*x
data = np.array([1,2,5])
pool = Pool(processes=4)
forest = pool.map(f, (data for i in range(4)))
# I use range() instead of xrange() because I am using Python 3.4
And now....the script is running like forever.....I open a python shell and enter the script line by line, and this is the messages I've got:
> Process SpawnPoolWorker-1: > Process SpawnPoolWorker-2: > Traceback (most recent call last): > Process SpawnPoolWorker-3: > Traceback (most recent call last): > Process SpawnPoolWorker-4: > Traceback (most recent call last): > Traceback (most recent call last): > File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap self.run() > File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap self.run() > File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap self.run() > File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap self.run() > File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, **self._kwargs) > File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, **self._kwargs) > File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, **self._kwargs) > File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, **self._kwargs) > File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() > File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() > File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() > File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker task = get() > File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get return ForkingPickler.loads(res) > File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get return ForkingPickler.loads(res) > AttributeError: Can't get attribute 'f' on > AttributeError: Can't get attribute 'f' on File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get return ForkingPickler.loads(res) > AttributeError: Can't get attribute 'f' on File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get return ForkingPickler.loads(res) > AttributeError: Can't get attribute 'f' on
Update: I edited my sample code according to some other example code like this:
from multiprocessing import Pool
import numpy as np
def f(x):
return 2*x
if __name__ == '__main__':
data = np.array([1,2,3])
with Pool(5) as p:
result = p.map(f, (data for i in range(300)))
And it works now. What I need to do now is to fill in this with more sophisticated algorithm now..
Yet another question in my mind is: why could this code work, while the previous version couldn't?
if __name__ == '__main__':
. if you read the multiprocessing python docs you will find that this is an explicit requirement for mp to work. - Ramon