4
votes

It would be good to get some tips on tuning Apache Spark for Random Forest classification.
Currently, we have a model that looks like:

  • featureSubsetStrategy all
  • impurity gini
  • maxBins 32
  • maxDepth 11
  • numberOfClasses 2
  • numberOfTrees 100

We are running Spark 1.5.1 as a standalone cluster.

  • 1 Master and 2 Worker nodes.
  • The amount of RAM is 32GB on each node with 4 Cores.
  • The classification takes 440ms.

When we increase the number of trees to 500, it takes 8 sec already. We tried to reduce the depth but then error rate is higher. We have around 246 attributes.

Probably we are doing something wrong. Any ideas how we could improve the performance ?

1
I'm not familiar with Spark, but maybe that's some memory-related problem (e.g. swap)? Because seems that your runtime increases nonlinearly.Ibraim Ganiev
Is it just the prediction that is slow, or also training? Are you just trying to predict one example or many?David Maust
The prediction is very slow, that's the main problem. Before the training was slow as well but speed increased after we have removed categorical features.Alex Ratnikov
Did you solve this issue?Daniel Nitzan

1 Answers

0
votes

Increasing the number of decision trees will definitely increase the prediction time, as the problem instance has to traverse through all the trees. But reducing it is no good for prediction accuracy. You have to vary this parameter (number of decision trees) and find an optimal value. That is why it is called hyper-parameter. Hyper parameters are highly dependent on the nature of your data and attributes. You may need to vary other hyper-parameters as well, one by one, and achieve global optimum.

Also, when you say prediction time, are you including the time to load the model as well ! If so, I guess the model time should not be considered for prediction time. This is only an overhead for loading your model and preparing the application for prediction.