I have implemented a neural network with 3 layers Input to Hidden Layer with 30 neurons(Relu Activation) to Softmax Output layer. I am using the cross entropy cost function. No outside libraries are being used. This is working on the NMIST dataset so 784 input neurons and 10 output neurons. I have got about 96% accuracy with hyperbolic tangent as my hidden layer activation. When I try to switch to relu activation my activations grow very fast which cause my weights grow unbounded as well until it blows up!
Is this a common problem to have when using relu activation?
I have tried L2 Regularization with minimal success. I end up having to set the learning rate lower by a factor of ten compared to the tanh activation and I have tried adjusting the weight decay rate accordingly and still the best accuracy I have gotten is about 90%. The rate of weight decay is still outpaced in the end by the updating of certain weights in the network which lead to an explosion. It seems everyone is just replacing their activation functions with relu and they experience better results, so I keep looking for bugs and validating my implementation. Is there more that goes into using relu as an activation function? Maybe I have problems in my implemenation, can someone validate accuracy with the same neural net structure?