Is using softmax as a hidden layer activation function acceptable in a regression (NOT classification) problem?

Question

I have done manual hyperparameter optimization for ML models before and always defaulted to tanh or relu as hidden layer activation functions. Recently, I started trying out Keras Tuner to optimize my architecture and accidentally left softmax as a choice for hidden layer activation.

I have only ever seen softmax used in classification models in the output layer, never as a hidden layer activation, especially for regression. This model has really good performance in predicting temperature, but I am having a tough time justifying using this model.

I have seen posts like this one which talk about why it should be used only for the output, but is there any justification in my case? I am showing the overall architecture below, for reference.

model = Sequential()
model.add(Dense(648, activation='relu',input_shape=(train_x.shape[1],)))
model.add(Dropout(0.3))
model.add(LayerNormalization())
model.add(Dense(152,activation='relu'))
model.add(Dropout(0.15))
model.add(LayerNormalization())
model.add(Dense(924,activation='softsign'))
model.add(Dropout(0.37))
model.add(LayerNormalization())
model.add(Dense(248,activation='softmax'))
model.add(Dropout(0.12))
model.add(LayerNormalization())
model.add(Dense(1,activation='linear'))
model.compile(loss='mse',optimizer='Adam')

Well I have 6 inputs and 1 output. The number of training samples used for the tuner was 100,000 but I have over 80 million total. — WVJoe

Frightera Frightera · Accepted Answer · 2021-02-05T20:56:20

I could be wrong, it should not differ whether if it is a classification or regression. Think about it mathematically.

Generally speaking, having softmax in the hidden layers is not preferred because we want every neuron to be independent from each other. If you apply softmax then they will be linearly dependent as the activation will force their sum to be equal to one. That does not mean it is never used, you can refer this paper.

Assume using some advanced activations such as LeakyReLU, by using it neurons will be under control as alpha rate can be tuned. But with softmax that will not be possible.

Now back the question, this is dependent to dataset I think. Model is able to generalize this dataset with softmax. However I don't think it will always work that way. As mentioned above, you are making them linearly dependent to each other. So if one neuron learns something wrong, that will effect whole network's generalization because other values will be effected.

Edit: I tested two models. With some data softmax worked as good as relu. But the case is all neurons are dependent to each other. Making them dependent to each other is not a risk that should be taken, especially in large networks.

Data:

X_train = np.random.randn(10000,20)
y_train = np.random.randn(10000,1)
X_test = np.random.randn(5000,20)
y_test = np.random.randn(5000,1)

With Softmax:

model = Sequential()
model.add(Dense(512, activation='relu',input_shape=(20,)))
model.add(Dense(256,activation='softmax'))
model.add(Dense(512,activation='softmax'))
model.add(Dense(256,activation='softmax'))
model.add(Dense(128,activation='softmax'))
model.add(Dense(1,activation='linear'))
model.compile(loss='mse',optimizer='adam')
model.fit(X_train, y_train, epochs = 16, validation_data= (X_test, y_test))

Result: Model was not able to learn this data. It diverged and stayed in the same region as diverging. Seems like one neuron wants to learn but other one is not letting the other one.

Epoch 15/16
313/313 [==============================] - 1s 3ms/step - loss: 1.0259 - val_loss: 1.0269
Epoch 16/16
313/313 [==============================] - 1s 3ms/step - loss: 1.0020 - val_loss: 1.0271

With relu:

model = Sequential()
model.add(Dense(512, activation='relu',input_shape=(20,)))
model.add(Dense(256,activation='relu'))
model.add(Dense(512,activation='relu'))
model.add(Dense(256,activation='relu'))
model.add(Dense(128,activation='relu'))
model.add(Dense(1,activation='linear'))
model.compile(loss='mse',optimizer='adam')
model.fit(X_train, y_train, epochs = 16, validation_data= (X_test, y_test))

# Obviously overfitting but that's not the case.

Result: The models with relu was able to learn both of the data.

Epoch 15/16
313/313 [==============================] - 1s 3ms/step - loss: 0.5580 - val_loss: 1.3091
Epoch 16/16
313/313 [==============================] - 1s 3ms/step - loss: 0.4808 - val_loss: 1.3290

Is using softmax as a hidden layer activation function acceptable in a regression (NOT classification) problem?

1 Answers