4
votes

Hidden layers of a classifier network use sigmoid or another activation function to introduce non-linearity and normalize the data, but does the last layer use sigmoid in conjunction with softmax?

I have a feeling it doesn't matter and the network will train either way -- but should a softmax layer alone be used? or should the sigmoid function be applied first?

1

1 Answers

3
votes

In general, there's no point in additional sigmoid activation just before the softmax output layer. Since the sigmoid function is a partial case of softmax, it will just squash the values into [0, 1] interval two times in a row, which would give be a nearly uniform output distribution. Of course, you can propagate through this, but it'll be much less efficient.

By the way, if you chose not to use ReLu, tanh is by all means a better activation function than sigmoid.