Nans Being Generated after training a neural network for sometime using tensorflow

Question

I am facing this problem since a few days. I don't know where I am making a mistake. My code is lengthy and could not reproduce everything here

Here are the results in first case:

Accuracy: 0.1071 Error: 1.45003
Accuracy: 0.5149 Error: 0.259084
Accuracy: 0.7199 Error: 0.197301
Accuracy: 0.7934 Error: 0.138881
Accuracy: 0.8137 Error: 0.136115
Accuracy: 0.8501 Error: 0.15382
Accuracy: 0.8642 Error: 0.100813
Accuracy: 0.8761 Error: 0.0882854
Accuracy: 0.882 Error: 0.0874575
Accuracy: 0.8861 Error: 0.0629579
Accuracy: 0.8912 Error: 0.101606
Accuracy: 0.8939 Error: 0.0744626
Accuracy: 0.8975 Error: 0.0775732
Accuracy: 0.8957 Error: 0.0909776
Accuracy: 0.9002 Error: 0.0799101
Accuracy: 0.9034 Error: 0.0621196
Accuracy: 0.9004 Error: 0.0752576
Accuracy: 0.9068 Error: 0.0531508
Accuracy: 0.905 Error: 0.0699344
Accuracy: 0.8941 Error: nan
Accuracy: 0.893 Error: nan
Accuracy: 0.893 Error: nan

I have tried various things but failed to figure out where I am making a mistake.

1) Change cross-entropy calculations to different things

self._error = -tf.reduce_sum(y*pred+ 1e-9))
self._error = -tf.reduce_sum(y*pred)
self._error = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=pred, labels=y))
self._error = tf.reduce_mean(-tf.reduce_sum(y * tf.log(pred+1e-8),reduction_indices=1))

out = tf.nn.softmax_cross_entropy_with_logits(logits = pred, labels=y)
self._error= tf.reduce_mean(out)

I have tried all the optimizers - sgd - adam - adagrad - rmsprop

I have used both default optimizers provided by tensorflow and manually applied different parameters. To point I have even checked with learning rates as small as 0.00001

Bias:
I have tried both 1.0 and 0.0

Weights:
Initialized with tf.truncated_normal_initializer(stddev=0.1, dtype = tf.float32)

Network:
FC784 - FC256 - FC128 - FC10
I have tried different variants of it also.

Activation Function:
- Relu - Tanh - leaky relu tf.maximum(input, 0.1*input)

Data:
MNIST dataset normalized by dividing it with 255. The dataset is from Keras.

I know this question is asked in various stackoverflow question and I have tried all the methods suggested der and to my knowledge none of them helped me.

Sometime devices give None answer for some reason, it's normal but how to you handle this ? NAN equal to "don't prepared yet", "resource is busy", "value is overflowed" etc. All resource access got delay, like: you got delay 1/CPU_CLOCK on a computer. Check your device capabilities... 0/None = NAN — dsgdfg
I have run the code in 3-4 systems. I have used both cpu and gpu, it is the same case. — Prakash Vanapalli

Jimmy Tran Jimmy Tran · Accepted Answer · 2017-03-24T07:53:10

From the information above it's hard to tell what went wrong. Yes, debugging neural network can be very tedious. Luckily, Tensorflow Debugger is a great tool that allows you to step through the network at every iteration and analyze your weights.

Run the following command in tfdbg to get to the first nan or inf value that shows up in the graph.

run -f has_inf_or_nan

Nans Being Generated after training a neural network for sometime using tensorflow

2 Answers