I want to understand in more details how a softmax layer can look in a CNN for semantic segmentation / pixelwise classification of an image. The CNN outputs an image of class labels, where each pixel of the original image gets a label.
After passing a test image through the network, the next-to-last layer outputs N channels of the resolution of the original image. My question is, how the softmax layer transforms these N channels to the final image of labels.
Assumed we have C classes (# possible labels). My suggestion is that for each pixel, its N neurons of the previous layer are connected to C neurons in the softmax layer, where each of the C neurons represents one class. Using the softmax activation function, the sum of the C outputs (for this pixel) is equal to 1 (which facilitates training of the network). Last, each pixel is classified as the class with the highest probability (given by softmax values). This would mean, that the softmax layer consists of C * #pixels neurons. Is my suggestion correct? I didn't find an explanation for this and hope that you can help me.
Thanks for helping!