0
votes

I am trying to build a randomforest on a data set with 120k rows and 518 columns. I have two questions: 1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function? 2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.

H2O cluster is initialized with below settings:

hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical -output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g

h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE, nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )

2

2 Answers

0
votes

Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.

A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.

You can compare the activity before you start your RF and after you start your RF.

If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.

You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.

[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]

0
votes

That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).

If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.

For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.