I have a Java application that trains a MLlib Random Forest (org.apache.spark.mllib.tree.RandomForest) on a training-set with 200K samples. I've noticed that only one CPU core is utilised during training. Given that a Random Forest is an ensemble of N Decision Trees, one would think that the trees could be trained in parallel, and thus utilising all available cores. Is there a configuration option or API call or anything else that can enable parallel training of the Decision Trees?
1 Answers
0
votes
I found the answer to this. The issue was with how I set up the Spark configuration using SparkConf.setMaster("local"). I changed this to SparkConf.setMaster("local[16]") to use 16 threads, as per the javadoc:
Now my training us running far quicker, and an Amazon datacentre in Virginia is slightly hotter :)
A typical case of RTFM, but in my defence this use of setMaster() seems a bit hacky to me. A better design would be to add a separate method for setting the number of local threads/cores to use.
org.apache.spark.mllib.tree.RandomForest
– user6022341