The main point is answered satisfactorily with the brilliant piece of sleuthing by desernaut. However there are occasions when BCE (binary cross entropy) could throw different results than CCE (categorical cross entropy) and may be the preferred choice. While the thumb rules shared above (which loss to select) work fine for 99% of the cases, I would like to add a few new dimensions to this discussion.
The OP had a softmax activation and this throws a probability distribution as the predicted value. It is a multi-class problem. The preferred loss is categorical CE. Essentially this boils down to -ln(p) where 'p' is the predicted probability of the lone positive class in the sample. This means that the negative predictions dont have a role to play in calculating CE. This is by intention.
On a rare occasion, it may be needed to make the -ve voices count. This can be done by treating the above sample as a series of binary predictions. So if expected is [1 0 0 0 0] and predicted is [0.1 0.5 0.1 0.1 0.2], this is further broken down into:
expected = [1,0], [0,1], [0,1], [0,1], [0,1]
predicted = [0.1, 0.9], [.5, .5], [.1, .9], [.1, .9], [.2, .8]
Now we proceed to compute 5 different cross entropies - one for each of the above 5 expected/predicted combo and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.8)]
The CE has a different scale but continues to be a measure of the difference between the expected and predicted values. The only difference is that in this scheme, the -ve values are also penalized/rewarded along with the +ve values. In case your problem is such that you are going to use the output probabilities (both +ve and -ves) instead of using the max() to predict just the 1 +ve label, then you may want to consider this version of CE.
How about a multi-label situation where expected = [1 0 0 0 1]? Conventional approach is to use one sigmoid per output neuron instead of an overall softmax. This ensures that the output probabilities are independent of each other. So we get something like:
expected = [1 0 0 0 1]
predicted is = [0.1 0.5 0.1 0.1 0.9]
By definition, CE measures the difference between 2 probability distributions. But the above two lists are not probability distributions. Probability distributions should always add up to 1. So conventional solution is to use same loss approach as before - break the expected and predicted values into 5 individual probability distributions, proceed to calculate 5 cross entropies and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.9)] = 3.3
The challenge happens when the number of classes may be very high - say a 1000 and there may be only couple of them present in each sample. So the expected is something like: [1,0,0,0,0,0,1,0,0,0.....990 zeroes]. The predicted could be something like: [.8, .1, .1, .1, .1, .1, .8, .1, .1, .1.....990 0.1's]
In this case the CE =
- [ ln(.8) + ln(.8) for the 2 +ve classes and 998 * ln(0.9) for the 998 -ve classes]
= 0.44 (for the +ve classes) + 105 (for the negative classes)
You can see how the -ve classes are beginning to create a nuisance value when calculating the loss. The voice of the +ve samples (which may be all that we care about) is getting drowned out. What do we do? We can't use categorical CE (the version where only +ve samples are considered in calculation). This is because, we are forced to break up the probability distributions into multiple binary probability distributions because otherwise it would not be a probability distribution in the first place. Once we break it into multiple binary probability distributions, we have no choice but to use binary CE and this of course gives weightage to -ve classes.
One option is to drown the voice of the -ve classes by a multiplier. So we multiply all -ve losses by a value gamma where gamma < 1. Say in above case, gamma can be .0001. Now the loss comes to:
= 0.44 (for the +ve classes) + 0.105 (for the negative classes)
The nuisance value has come down. 2 years back Facebook did that and much more in a paper they came up with where they also multiplied the -ve losses by p to the power of x. 'p' is the probability of the output being a +ve and x is a constant>1. This penalized -ve losses even further especially the ones where the model is pretty confident (where 1-p is close to 1). This combined effect of punishing negative class losses combined with harsher punishment for the easily classified cases (which accounted for majority of the -ve cases) worked beautifully for Facebook and they called it focal loss.
So in response to OP's question of whether binary CE makes any sense at all in his case, the answer is - it depends. In 99% of the cases the conventional thumb rules work but there could be occasions when these rules could be bent or even broken to suit the problem at hand.
For a more in-depth treatment, you can refer to: https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0
categorical_crossentropy
. Also labels need to converted into the categorical format. Seeto_categorical
to do this. Also see definitions of categorical and binary crossentropies here. – Autonomouscategorical_crossentropy
. If you have two classes, they will be represented as0, 1
in binary labels and10, 01
in categorical label format. – AutonomousDense(1, activation='softmax')
for binary classification is simply wrong. Remember softmax output is a probability distribution that sums to one. If you want to have only one output neuron with binary classification, use sigmoid with binary cross-entropy. – Autonomous