Parallel training for Apache MLlib Random Forest

Question

I have a Java application that trains a MLlib Random Forest (org.apache.spark.mllib.tree.RandomForest) on a training-set with 200K samples. I've noticed that only one CPU core is utilised during training. Given that a Random Forest is an ensemble of N Decision Trees, one would think that the trees could be trained in parallel, and thus utilising all available cores. Is there a configuration option or API call or anything else that can enable parallel training of the Decision Trees?

If you see only one active thread it is either your code or configuration, not org.apache.spark.mllib.tree.RandomForest — user6022341
@eliasah Let's give Morten Jorgensen time to update this question. — user6022341

Morten Jorgensen Morten Jorgensen · Accepted Answer · 2017-01-25T13:28:44

I found the answer to this. The issue was with how I set up the Spark configuration using SparkConf.setMaster("local"). I changed this to SparkConf.setMaster("local[16]") to use 16 threads, as per the javadoc:

http://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html#setMaster(java.lang.String)

Now my training us running far quicker, and an Amazon datacentre in Virginia is slightly hotter :)

A typical case of RTFM, but in my defence this use of setMaster() seems a bit hacky to me. A better design would be to add a separate method for setting the number of local threads/cores to use.

Parallel training for Apache MLlib Random Forest

1 Answers