0
votes

First of all, I know there are quite some threads about multiprocessing on python already, but none of these seems to solve my problem.

Here is my problem: I want to implement Random Forest Algorithm, and a naive way to do so would be like this:

def random_tree(Data):
    tree = calculation(Data)
    forest.append(tree)

forest = list()
for i in range(300):
    random_tree(Data)

And theforest with 300 "trees" inside would be my final result. In this case, how do I turn this code into a multiprocessing version?


Update: I just tried Mukund M K's method, in a very simplified script:

from multiprocessing import Pool

def f(x):
    return 2*x

data = np.array([1,2,5])

pool = Pool(processes=4)
forest = pool.map(f, (data for i in range(4))) 
# I use range() instead of xrange() because I am using Python 3.4

And now....the script is running like forever.....I open a python shell and enter the script line by line, and this is the messages I've got:

> Process SpawnPoolWorker-1:  
> Process SpawnPoolWorker-2:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-3:  
> Traceback (most recent call last):  
> Process SpawnPoolWorker-4:  
> Traceback (most recent call last):  
> Traceback (most recent call last):  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 254, in _bootstrap
    self.run()  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
    task = get()  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on   
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on 
  File "E:\Anaconda3\lib\multiprocessing\queues.py", line 357, in get
    return ForkingPickler.loads(res)  
> AttributeError: Can't get attribute 'f' on 

Update: I edited my sample code according to some other example code like this:

from multiprocessing import Pool
import numpy as np

def f(x):
    return 2*x

if __name__ == '__main__':
    data = np.array([1,2,3])
    with Pool(5) as p:
        result = p.map(f, (data for i in range(300)))

And it works now. What I need to do now is to fill in this with more sophisticated algorithm now..
Yet another question in my mind is: why could this code work, while the previous version couldn't?

2
"Data" is a 2-D(100*3) numpy array. - Sidney
are you just reading it or modifying the contents as well in calculation? if so, does the order in which it is modified matter? - Mukund M K
I only read the data. In random forest algorithm, I would randomly sample from the original data ("Data")to build a tree. So every iteration is independent, that is why I think it should be able to parallelized. - Sidney
i know this is old but just in case. the cultprit here probably is the missing if __name__ == '__main__':. if you read the multiprocessing python docs you will find that this is an explicit requirement for mp to work. - Ramon

2 Answers

1
votes

You can do it with multiprocessing this way:

from multiprocessing import Pool

def random_tree(Data):
    return calculation(Data)

pool = Pool(processes=4)
forest = pool.map(random_tree, (Data for i in range(300)))
0
votes

Package processing might help you. Check it out here.