scikit-learn Random Forest excessive memory usage

Question

I'm running scikit-learn (version 0.15.2) Random Forest with python 3.4 in windows 7 64-bit. I have this very simple model:

import numpy as np
from sklearn.ensemble import RandomForestClassifier

#Data=np.genfromtxt('C:/Data/Tests/Train.txt', delimiter=',')

print ("nrows = ", Data.shape[0], "ncols = ", Data.shape[1]) 
X=np.float32(Data[:,1:])
Y=np.int16(Data[:,0])
RF = RandomForestClassifier(n_estimators=1000)
RF.fit(X, Y)

The X dataset has about 30,000 x 500 elements of the following format:

139.2398242257808,310.7242684642465,...

Even with no parallel processing, the memory usage creeps up to 16 GB eventually! I'm wondering why there is so much memory usage.

I know this has been asked before sometime ago but before the 0.15.2 version...

Any suggestions?

your estimator parameter is very high and you are not controlling for the depth of the trees. — JAB
Thank you. What would you recommend for the ranges for the depth of of trees and n_estimator for the case of multi-classification of ~100 classes? — Chinook
Have you tried reducing the number of estimators and seeing if it reduces the memory used? It is just a guess I'm afraid- I wondered if you were creating lots of very deep trees. — JAB
Yes, it is reducing the memory usage. And with the depth of the trees not set, I guess it was going wild. I have a big dataset with many features and classes and I thought I needed to use many trees. I need to experiment more... — Chinook
You can also try to control the depth of the trees by increasing the number of datapoints needed at each split. I think the default is two. Increasing this might allow you to increase the number of estimators. — JAB

Gilles Louppe Gilles Louppe · Accepted Answer · 2015-02-21T16:43:32

Unfortunately, memory consumption is linear with the number of classes. Since you have 100s of them and quite a decent number of samples, it is not surprising that memory is blowing up. Solutions include controlling the size of the trees (max_depth, min_samples_leaf, ...), their numbers (n_estimators) or to reduce the number of classes in your problem, if that is possible.

scikit-learn Random Forest excessive memory usage

2 Answers