0
votes

I'm having problems with a django application that uses a random forest classifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to classify items. The error that I'm receiving says:

AttributeError at /items/

'Thread' object has no attribute '_children'

Request Method:     POST
Request URL:    http://localhost:8000/items/
Django Version:     1.7.6
Exception Type:     AttributeError
Exception Value:    

'Thread' object has no attribute '_children'

Exception Location:     /usr/lib/python2.7/multiprocessing/dummy/__init__.py in start, line 73
Python Executable:  /home/cristian/env/bin/python
Python Version:     2.7.3
Python Path:    

['/home/cristian/filters',
 '/home/cristian/env/lib/python2.7',
 '/home/cristian/env/lib/python2.7/plat-linux2',
 '/home/cristian/env/lib/python2.7/lib-tk',
 '/home/cristian/env/lib/python2.7/lib-old',
 '/home/cristian/env/lib/python2.7/lib-dynload',
 '/usr/lib/python2.7',
 '/usr/lib/python2.7/plat-linux2',
 '/usr/lib/python2.7/lib-tk',
 '/home/cristian/env/local/lib/python2.7/site-packages']

Server time:    Fri, 24 Apr 2015 16:08:20 +0000

The problem is that I'm not using threads at all. This is the code:

def item_to_dict(item):
    item_dict = {}
    for key in item:
            value = item[key]
            # fix encoding 
            if isinstance(value, unicode):
                    value = value.encode('utf-8')
            item_dict[key] = [value]
    return item_dict

def load_classifier(filter_name):
        clf = joblib.load(os.path.join(CLASSIFIERS_PATH, filter_name, 'random_forest.100k.' + filter_name.lower() +  '.pkl'))
        return clf

@api_view(['POST'])
 def classify_item(request):
    """
    Classify item
    """
    if request.method == 'POST':
            serializer = ItemSerializer(data=request.data['item'])
            if serializer.is_valid():
                    # get item and filter_name
                    item = serializer.data
                    filter_name = request.data['filter']

                    item_dict = item_to_dict(item)

                    clf = load_classifier(filter_name)

                    # score item
                    y_pred = clf.predict_proba(pd.DataFrame(item_dict))
                    item_score = y_pred[0][1]

                    # create and save classification
                    classification = Classification(classifier_name=filter_name,score=item_score,item_id=item['_id'])
                    classification_serializer = ClassificationSerializer(classification)
                    return Response(classification_serializer.data, status=status.HTTP_201_CREATED)
            else:
                    return Response(serializer.errors, status=status.HTTP_400_BAD_REQUEST)

I'm able to print out the "clf" and "item_dict" variables and everything seems ok. The error raises when I call the method "predict_proba" of the classifier. One important thing to add is that I don't recieve the error when I run the server and send the post method for the first time.

Here's the full traceback:

File "/home/cristian/env/local/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
line 111.                     response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/cristian/env/local/lib/python2.7/site-packages/django/views/decorators/csrf.py" in wrapped_view
line 57.         return view_func(*args, **kwargs)
File "/home/cristian/env/local/lib/python2.7/site-packages/django/views/generic/base.py" in view
line 69.             return self.dispatch(request, *args, **kwargs)
File "/home/cristian/env/local/lib/python2.7/site-packages/rest_framework/views.py" in dispatch
line 452.             response = self.handle_exception(exc)
File "/home/cristian/env/local/lib/python2.7/site-packages/rest_framework/views.py" in dispatch
line 449.             response = handler(request, *args, **kwargs)
File "/home/cristian/env/local/lib/python2.7/site-packages/rest_framework/decorators.py" in handler
line 50.             return func(*args, **kwargs)
File "/home/cristian/filters/classifiers/views.py" in classify_item
line 70.             y_pred = clf.predict_proba(pd.DataFrame(item_dict))
File "/home/cristian/env/local/lib/python2.7/site-packages/sklearn/pipeline.py" in predict_proba
line 159.         return self.steps[-1][-1].predict_proba(Xt)
File "/home/cristian/env/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py" in predict_proba
line 468.             for i in range(n_jobs))
File "/home/cristian/env/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py" in __call__
line 568.             self._pool = ThreadPool(n_jobs)
File "/usr/lib/python2.7/multiprocessing/pool.py" in __init__
line 685.         Pool.__init__(self, processes, initializer, initargs)
File "/usr/lib/python2.7/multiprocessing/pool.py" in __init__
line 136.         self._repopulate_pool()
File "/usr/lib/python2.7/multiprocessing/pool.py" in _repopulate_pool
line 199.             w.start()
File "/usr/lib/python2.7/multiprocessing/dummy/__init__.py" in start
line 73.         self._parent._children[self] = None

Exception Type: AttributeError at /items/
Exception Value: 'Thread' object has no attribute '_children'
2
Please include the full traceback for us to be able to understand the root cause of your problem.ogrisel

2 Answers

0
votes

As a workaround, you can disable the threading at prediction time with:

clf = load_classifier(filter_name)
clf.set_params(n_jobs=1)
y_pred = clf.predict_proba(pd.DataFrame(item_dict))

Also note, calling load_classifier at each request might be expensive it actually loads the model from the disk.

You can pass mmap_mode='r' to joblib.load to memory map the data from the disk. It will make it possible to load the model only once even if you have concurrent requests accessing the same model parameters concurrently (both with different threads and different Python processes if you use something like gunicorn).

0
votes

It looks like that problem was fixed from Python 2.7.5. It was basically a bug in multiprocessing.