It would be good to get some tips on tuning Apache Spark for Random Forest classification.
Currently, we have a model that looks like:
- featureSubsetStrategy all
- impurity gini
- maxBins 32
- maxDepth 11
- numberOfClasses 2
- numberOfTrees 100
We are running Spark 1.5.1 as a standalone cluster.
- 1 Master and 2 Worker nodes.
- The amount of RAM is 32GB on each node with 4 Cores.
- The classification takes 440ms.
When we increase the number of trees to 500, it takes 8 sec already. We tried to reduce the depth but then error rate is higher. We have around 246 attributes.
Probably we are doing something wrong. Any ideas how we could improve the performance ?