3
votes

Here is my scenario
I have used EMNIST database of capital letters of english language.
My neural network is as follows

  1. Input layer has 784 neurons which are pixel values of image 28x28 grey scaled image divided by 255 so value will be in range[0,1]
  2. Hidden layer has 49 neuron fully connected to previous 784.
  3. Output layer has 9 neurons denoting class of image.
  4. Loss function is defined as cross entropy of softmax of output layer.
    Initialized all weights as random real number from [-1,+1].

Now I did training with 500 fixed samples for each class.

Simply, passed 500x9 images to train function which uses backpropagation and does 100 iterations changing weights by learning_rate*derivative_of_loss_wrt_corresponding_weight.

I found that when I use tanh activation on neuron then network learns faster than relu with learning rate 0.0001.

I concluded that because accuracy on fixed test dataset was higher for tanh than relu . Also , loss value after 100 epochs was slightly lower for tanh.

Isn't relu expected to perform better ?

1

1 Answers

4
votes

Isn't relu expected to perform better ?

In general, no. RELU will perform better on many problems but not all problems.

Furthermore, if you use an architecture and set of parameters that is optimized to perform well with one activation function, you may get worse results after swapping in a different activation function.

Often you will need to adjust the architecture and parameters like learning rate to get comparable results. This may mean changing the number of hidden nodes and/or the learning rate in your example.

One final note: In the MNIST example architectures I have seen, hidden layers with RELU activations are typically followed by Dropout layers, whereas hidden layers with sigmoid or tanh activations are not. Try adding dropout after the hidden layer and see if that improves your results with RELU. See the Keras MNIST example here.