Softmax MLP Classifier - which activation function to use in hidden layer?

Question

I am writing a single Multi-Layer Perceptron from scratch, with just an input, hidden and output layer. The output layer will use the softmax activation function to produce probabilities of several mutually exclusive outputs.

In my hidden layer it does not make sense to me to use the softmax activation function too - is this correct? If so can I just use any other non-linear activation function such as sigmoid or tanh? Or could I even not use any activation function in the hidden layer and just keep the values of the hidden nodes as the linear combinations of the input nodes and input-to-hidden weights?

desertnaut desertnaut · Accepted Answer · 2018-04-20T13:08:03

In my hidden layer it does not make sense to me to use the softmax activation function too - is this correct?

It is correct indeed.

If so can I just use any other non-linear activation function such as sigmoid or tanh?

You can, but most modern approaches would call for a Rectified Linear Unit (ReLU), or some of its variants (Leaky ReLU, ELU etc).

Or could I even not use any activation function in the hidden layer and just keep the values of the hidden nodes as the linear combinations of the input nodes and input-to-hidden weights?

No. The non-linear activations are indeed what prevents a (possibly large) neural network from behaving just like a single linear unit; it can be shown (see Andrew Ng's relevant lecture @ Coursera Why do you need non-linear activation functions?) that:

It turns out that if you use a linear activation function, or alternatively if you don't have an activation function, then no matter how many layers your neural network has, what is always doing is just computing a linear activation function, so you might as well not have any hidden layers.

The take-home is that a linear hidden layer is more or less useless because the composition of two linear functions is itself a linear function; so unless you throw a non-linearity in there then you're not computing more interesting functions even as you go deeper in the network.

Practically, the only place where you could use a linear activation function is the output layer for regression problems (explained also in the lecture linked above).

Softmax MLP Classifier - which activation function to use in hidden layer?

2 Answers