Hidden layers of a classifier network use sigmoid or another activation function to introduce non-linearity and normalize the data, but does the last layer use sigmoid in conjunction with softmax?
I have a feeling it doesn't matter and the network will train either way -- but should a softmax layer alone be used? or should the sigmoid function be applied first?