Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

Question

I'm trying to train a CNN to categorize text by topic. When I use binary cross-entropy I get ~80% accuracy, with categorical cross-entropy I get ~50% accuracy.

I don't understand why this is. It's a multiclass problem, doesn't that mean that I have to use categorical cross-entropy and that the results with binary cross-entropy are meaningless?

model.add(embedding_layer)
model.add(Dropout(0.25))
# convolution layers
model.add(Conv1D(nb_filter=32,
                    filter_length=4,
                    border_mode='valid',
                    activation='relu'))
model.add(MaxPooling1D(pool_length=2))
# dense layers
model.add(Flatten())
model.add(Dense(256))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# output layer
model.add(Dense(len(class_id_index)))
model.add(Activation('softmax'))

Then I compile it either it like this using categorical_crossentropy as the loss function:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

or

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Intuitively it makes sense why I'd want to use categorical cross-entropy, I don't understand why I get good results with binary, and poor results with categorical.

If it is a multiclass problem, you have to use categorical_crossentropy. Also labels need to converted into the categorical format. See to_categorical to do this. Also see definitions of categorical and binary crossentropies here. — Autonomous
My labels are categorical, created using to_categorical (one hot vectors for each class). Does that mean the ~80% accuracy from binary crossentropy is just a bogus number? — Daniel Messias
I think so. If you use categorical labels i.e. one hot vectors, then you want categorical_crossentropy. If you have two classes, they will be represented as 0, 1 in binary labels and 10, 01 in categorical label format. — Autonomous
I think he just compares to the first number in the vector and ignores the rest. — Thomas Pinetz
@NilavBaranGhosh The representation will be [[1, 0], [0, 1]] for a categorical classification involving two classes (not [[0, 0], [0, 1]] like you mention). Dense(1, activation='softmax') for binary classification is simply wrong. Remember softmax output is a probability distribution that sums to one. If you want to have only one output neuron with binary classification, use sigmoid with binary cross-entropy. — Autonomous

desertnaut desertnaut · Accepted Answer · 2017-09-04T13:34:45

The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:

the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels

I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.

This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

is valid, your second one:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).

Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.

Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # WRONG way

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,  # only 2 epochs, for demonstration purposes
          verbose=1,
          validation_data=(x_test, y_test))

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.9975801164627075

# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001

score[1]==acc
# False

To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.98580000000000001

# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001

score[1]==acc
# True

System setup:

Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4

UPDATE: After my post, I discovered that this issue had already been identified in this answer.

Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

12 Answers