I'm running cross-validation deep learning training (nfolds=4) iteratively for feature selection on H2O through R. Currently, I have only 2 layers (i.e. not deep) and between 8 and 50 neurons per layer. There are only 323 inputs, and 12 output classes.
To train one model takes in average around 40 seconds on my Intel 4770K, (32 GB ram). During training, H2o is able to max out all cpu cores.
Now, to try to speed up the training, I've set up an EC2 instance in the amazon cloud. I tried the largest compute unit (c4.8xlarge), but the speed up was minimal. It took around 24 seconds to train one model with the same settings. Therefore, I suspecting there's something I've overlooked. I started the training like this:
localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '24G', nthreads=-1)
Just to compare the processors, the 4770K got 10163 on cpu benchmark, while the Intel Xeon E5-2666 v3 got 24804 (vCPU is 36).
This speedup is quite disappointing to say the least, and is not worth all the extra work of installing and setting everything up in the amazon cloud, while paying over $2/hour.
Is there something else that needs to be done to get all cores working besides setting nthreads=-1 ?
Do I need to start making several clusters in order to get the training time down, or should I just start on a new deep learning library that supports GPUs?