0
votes

I have a Java application that trains a MLlib Random Forest (org.apache.spark.mllib.tree.RandomForest) on a training-set with 200K samples. I've noticed that only one CPU core is utilised during training. Given that a Random Forest is an ensemble of N Decision Trees, one would think that the trees could be trained in parallel, and thus utilising all available cores. Is there a configuration option or API call or anything else that can enable parallel training of the Decision Trees?

1
If you see only one active thread it is either your code or configuration, not org.apache.spark.mllib.tree.RandomForestuser6022341
@LostInOverflow wiki answer ?eliasah
@eliasah Let's give Morten Jorgensen time to update this question.user6022341

1 Answers

0
votes

I found the answer to this. The issue was with how I set up the Spark configuration using SparkConf.setMaster("local"). I changed this to SparkConf.setMaster("local[16]") to use 16 threads, as per the javadoc:

http://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html#setMaster(java.lang.String)

Now my training us running far quicker, and an Amazon datacentre in Virginia is slightly hotter :)

A typical case of RTFM, but in my defence this use of setMaster() seems a bit hacky to me. A better design would be to add a separate method for setting the number of local threads/cores to use.