Understanding neural network output >1 with sigmoid activation function

Question

I'm experimenting with a model combining a convolutional neural network with a linear model. Here is a simplified version of it:

from tensorflow.keras import Sequential
from tensorflow.keras.experimental import WideDeepModel, LinearModel

num_classes = 1 ##(0='NO' or 1='YES')

cnn_model.Sequential()
cnn_model.add(Conv1D(20, 8, padding='same', activation='relu'))
cnn_model.add(GlobalAveragePooling1D())
cnn_model.add(Dropout(0.6))
cnn_model.add(Dense(num_classes, activation='sigmoid'))

linear_model = LinearModel()
combined_model = WideDeepModel(linear_model, cnn_model)
combined_model.compile(optimizer = ['sgd', 'adam'],
                            loss = ['mse','binary_crossentropy'], 
                         metrics = ['accuracy'])

Performance is very good and everything seems to be going well until I sorted the predictions by pval and I can see there are predictions >1 even when I'm using sigmoid activation which is I thought was supposed to bring everything between 0 and 1, and no activation function for the linear model (but inputs are all scaled 0-1):

pred = [ 1 if a > threshold else 0 for a in combined_model.predict([dplus_test, X_test])]
pv = combined_model.predict([dplus_test, X_test])
pval = [a[0] for a in pv]

    true    pred    pval    dplus
1633    1   1   1.002850    15.22404
1326    1   1   1.001444    10.34983
1289    1   1   1.001368    10.03043
1371    1   1   1.000986    10.74037
1188    1   1   1.000707    8.902

I checked on the other end of the data, and those predictions are as I expected, always >0.

    true    pred    pval    dplus
145     0   0   0.000463    1.81635
383     0   0   0.001023    3.24982
1053    0   0   0.001365    7.22535

This is not a problem so far, nothing crashes and I'm happy with the performance.

I would like to know if my understanding of the sigmoid activation function is wrong or if there is something in the Combined model that allows values to go above 1 and whether I can trust these results.

Addy Addy · Accepted Answer · 2020-10-22T23:15:37

It's because your sigmoid is defined only on the output of the Deep model and the way the WideDeepModel combines the two model's outputs is by adding them (and your Wide linear model can have arbitrary output). Since you include both mse and binary_crossentropy in your loss, the combined model actually learns to output values close to the expected range.

If you had just binary_crossentropy, you would probably see values much larger than 1, since the formula for the loss is -p * log(q) where q is the output of your network, you could make the loss arbitrarily small by increasing q indefinitely, which doesn't happen when your output is bounded.

The WideDeepModel has an additional attribute activation (see docs) where you can define the activation function of the whole model. If you want to squeeze the output between 0 and 1, set it to sigmoid.

combined_model = WideDeepModel(linear_model, cnn_model, activation='sigmoid')

Also as a final note, in my experience, combining mean squared error and binary crossentropy like this doesn't make much sense, in practice you'd choose one or the other.

Understanding neural network output >1 with sigmoid activation function

1 Answers