Random Forest - Verbose and Speed

Question

I am trying to build a randomforest on a data set with 120k rows and 518 columns. I have two questions: 1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function? 2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.

H2O cluster is initialized with below settings:

hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical -output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g

h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE, nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )

TomKraljevic TomKraljevic · Accepted Answer · 2017-06-24T20:01:18

Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.

A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.

You can compare the activity before you start your RF and after you start your RF.

If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.

You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.

[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]

Random Forest - Verbose and Speed

2 Answers