3
votes

I am running google cloud machine learning beta - and use the hypertune setup with tensorflow.

In some of the sub runs of hyperparameter tuning I have losses becoming NaNs - and that crashes the computations - which in turns stop the hyperparameter tuning job.

Error reported to Coordinator: <class  'tensorflow.python.framework.errors.InvalidArgumentError'>, 
Nan in summary histogram for: softmax_linear/HistogramSummary [[Node: softmax_linear/HistogramSummary = HistogramSummary[T=DT_FLOAT, 
_device="/job:master/replica:0/task:0/cpu:0"]
(softmax_linear/HistogramSummary/tag, softmax_linear/softmax_linear)]] 
Caused by op u'softmax_linear/HistogramSummary', defined at: File   
"/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 

What is the canonical way of handling these ? Should I protect the loss function ?

Thanks

1
Have you tried using tf.add_check_numerics to identify where the computation is becoming unstable? - Jeremy Lewi
Two questions: (1) is the loss fine for the first part of training, a just "eventually" goes to Nan (2) for the exact same settings of the hyperparameters that are failing, does the model sometime converge, or always go to NaN? - rhaertel80
It sometimes goes to NaN, depending on the hyperparameters. Basically, if the learning_rate is too large, or if the number of features is too high. My current countermeasure us to limit the exploring range for hyper parameter tuning. - MathiasOrtner
Jeremy, I will use this function. Thanks for the pointer. - MathiasOrtner

1 Answers

1
votes

You should protect the loss function by checking for NaNs. Any crash or exception thrown by the program is treated by Cloud ML as a failure of the trial, and if enough trials fail the entire job will be failed.

If the trial exits cleanly without setting any hyperparameter summaries, the trial will be considered Infeasible and hyperparameters similar to those will be less likely to be tried again, but it will not be an error.