Neural net fails on toy dataset

votes

I have created the following toy dataset:

I am trying to predict the class with a neural net in keras:

model = Sequential()
model.add(Dense(units=2, activation='sigmoid', input_shape= (nr_feats,)))
model.add(Dense(units=nr_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

With nr_feats and nr_classes set to 2. The neural net can only predict with 50 percent accuracy returning either all 1's or all 2's. Using Logistic Regression results in 100 percent accuracy.

I can not find what is going wrong here.

I have uploaded a notebook to github if you quickly want to try something.

EDIT 1

I drastically increased the number of epochs and accuracy finally starts to improve from 0.5 at epoch 72 and converges to 1.0 at epoch 98. This still seems extremely slow for such a simple dataset.

I am aware it is better to use a single output neuron with sigmoid activation but it's more that I want to understand why it does not work with two output neurons and softmax activation.

I pre-process my dataframe as follows:

from sklearn.preprocessing import LabelEncoder

x_train = df_train.iloc[:,0:-1].values
y_train = df_train.iloc[:, -1]

nr_feats = x_train.shape[1]
nr_classes = y_train.nunique()

label_enc = LabelEncoder()
label_enc.fit(y_train)

y_train = keras.utils.to_categorical(label_enc.transform(y_train), nr_classes)

Training and evaluation:

model.fit(x_train, y_train, epochs=500, batch_size=32, verbose=True)
accuracy_score(model.predict_classes(x_train),  df_train.iloc[:, -1].values)

EDIT 2

After changing the output layer to a single neuron with sigmoid activation and using binary_crossentropy loss as modesitt suggested, accuracy still remains at 0.5 for 200 epochs and converges to 1.0 100 epochs later.

pythonmachine-learningneural-networkkerasclassification

Have you tried increasing the number of epochs, for example to 20 or 50 or 100? Since your keras model is very small (i.e. has very few parameters), I suspect it takes more epochs to see any increase in accuracy, since the loss is decreasing so there is some progress happening. – today

@today the loss is decreasing as it converges to predicting all 1 class given my answer. This gives the cross-entropy loss for single class guessing. No neural network will take any substantial time for converging on an example this simple. – modesitt

@today Thanks for your help. I had increased the epochs a bit before, I now increased them further and it finally starts to improve from 0.5 at epoch 72 and converges to 1.0 at epoch 98. This seems extremely slow for such a simple dataset. – Anton

@modesitt I didn't notice that the labels are 1 and 2. You are right. It should be encoded as 0 and 1. And as you mentioned it is better to use a dense layer with one unit and an activation function of sigmoid for the last layer. But I still suspect that, after fixing these issues, you can't get an accuracy of higher than 60% considering such a small net and (relatively large amount of) random input data. It takes more than 5 epochs in my opinion. – today

@modesitt No! Actually you are wrong about labels. They are already encoded as one hot vectors so there is no problem in that regard. – today

2 Answers

votes

The issue is that your labels are 1 and 2 instead of 0 and 1. Keras will not raise an error when it sees 2, but it is not capable of predicting 2.

Subtract 1 from all your y values. As a side note, it is common in deep learning to use 1 neuron with sigmoid for binary classification (0 or 1) vs 2 classes with softmax. Finally, use binary_crossentropy for the loss for binary classification problems.

votes

Note: Read the "Update" section at the end of my answer if you want the true reason. In this scenario, the other two reasons I have mentioned are only valid when the learning rate is set to a low value (less than 1e-3).

I put together some code. It is very similar to yours but I just cleaned it a little bit and made it simpler for myself. As you can see, I use a dense layer with one unit with a sigmoid activation function for the last layer and just change the optimizer from adam to rmsprop (it is not important that much, you can use adam if you like):

import numpy as np
import random

# generate random data with two features
n_samples = 200
n_feats = 2

cls0 = np.random.uniform(low=0.2, high=0.4, size=(n_samples,n_feats))
cls1 = np.random.uniform(low=0.5, high=0.7, size=(n_samples,n_feats))
x_train = np.concatenate((cls0, cls1))
y_train = np.concatenate((np.zeros((n_samples,)), np.ones((n_samples,))))

# shuffle data because all negatives (i.e. class "0") are first
# and then all positives (i.e. class "1")
indices = np.arange(x_train.shape[0])
np.random.shuffle(indices)
x_train = x_train[indices]
y_train = y_train[indices]

from keras.models import Sequential
from keras.layers import Dense


model = Sequential()
model.add(Dense(2, activation='sigmoid', input_shape=(n_feats,)))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.summary()

model.fit(x_train, y_train, epochs=5, batch_size=32, verbose=True)

Here is the output:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_25 (Dense)             (None, 2)                 6         
_________________________________________________________________
dense_26 (Dense)             (None, 1)                 3         
=================================================================
Total params: 9
Trainable params: 9
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
400/400 [==============================] - 0s 966us/step - loss: 0.7013 - acc: 0.5000
Epoch 2/5
400/400 [==============================] - 0s 143us/step - loss: 0.6998 - acc: 0.5000
Epoch 3/5
400/400 [==============================] - 0s 137us/step - loss: 0.6986 - acc: 0.5000
Epoch 4/5
400/400 [==============================] - 0s 149us/step - loss: 0.6975 - acc: 0.5000
Epoch 5/5
400/400 [==============================] - 0s 132us/step - loss: 0.6966 - acc: 0.5000

As you can see the accuracy never increases from 50%. What if you increase the number of epochs to say 50:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_35 (Dense)             (None, 2)                 6         
_________________________________________________________________
dense_36 (Dense)             (None, 1)                 3         
=================================================================
Total params: 9
Trainable params: 9
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
400/400 [==============================] - 0s 1ms/step - loss: 0.6925 - acc: 0.5000
Epoch 2/50
400/400 [==============================] - 0s 136us/step - loss: 0.6902 - acc: 0.5000
Epoch 3/50
400/400 [==============================] - 0s 133us/step - loss: 0.6884 - acc: 0.5000
Epoch 4/50
400/400 [==============================] - 0s 160us/step - loss: 0.6866 - acc: 0.5000
Epoch 5/50
400/400 [==============================] - 0s 140us/step - loss: 0.6848 - acc: 0.5000
Epoch 6/50
400/400 [==============================] - 0s 168us/step - loss: 0.6832 - acc: 0.5000
Epoch 7/50
400/400 [==============================] - 0s 154us/step - loss: 0.6817 - acc: 0.5000
Epoch 8/50
400/400 [==============================] - 0s 146us/step - loss: 0.6802 - acc: 0.5000
Epoch 9/50
400/400 [==============================] - 0s 161us/step - loss: 0.6789 - acc: 0.5000
Epoch 10/50
400/400 [==============================] - 0s 140us/step - loss: 0.6778 - acc: 0.5000
Epoch 11/50
400/400 [==============================] - 0s 177us/step - loss: 0.6766 - acc: 0.5000
Epoch 12/50
400/400 [==============================] - 0s 180us/step - loss: 0.6755 - acc: 0.5000
Epoch 13/50
400/400 [==============================] - 0s 165us/step - loss: 0.6746 - acc: 0.5000
Epoch 14/50
400/400 [==============================] - 0s 128us/step - loss: 0.6736 - acc: 0.5000
Epoch 15/50
400/400 [==============================] - 0s 125us/step - loss: 0.6728 - acc: 0.5000
Epoch 16/50
400/400 [==============================] - 0s 165us/step - loss: 0.6718 - acc: 0.5000
Epoch 17/50
400/400 [==============================] - 0s 161us/step - loss: 0.6710 - acc: 0.5000
Epoch 18/50
400/400 [==============================] - 0s 170us/step - loss: 0.6702 - acc: 0.5000
Epoch 19/50
400/400 [==============================] - 0s 122us/step - loss: 0.6694 - acc: 0.5000
Epoch 20/50
400/400 [==============================] - 0s 110us/step - loss: 0.6686 - acc: 0.5000
Epoch 21/50
400/400 [==============================] - 0s 142us/step - loss: 0.6676 - acc: 0.5000
Epoch 22/50
400/400 [==============================] - 0s 142us/step - loss: 0.6667 - acc: 0.5000
Epoch 23/50
400/400 [==============================] - 0s 149us/step - loss: 0.6659 - acc: 0.5000
Epoch 24/50
400/400 [==============================] - 0s 125us/step - loss: 0.6651 - acc: 0.5000
Epoch 25/50
400/400 [==============================] - 0s 134us/step - loss: 0.6643 - acc: 0.5000
Epoch 26/50
400/400 [==============================] - 0s 143us/step - loss: 0.6634 - acc: 0.5000
Epoch 27/50
400/400 [==============================] - 0s 137us/step - loss: 0.6625 - acc: 0.5000
Epoch 28/50
400/400 [==============================] - 0s 131us/step - loss: 0.6616 - acc: 0.5025
Epoch 29/50
400/400 [==============================] - 0s 119us/step - loss: 0.6608 - acc: 0.5100
Epoch 30/50
400/400 [==============================] - 0s 143us/step - loss: 0.6601 - acc: 0.5025
Epoch 31/50
400/400 [==============================] - 0s 148us/step - loss: 0.6593 - acc: 0.5350
Epoch 32/50
400/400 [==============================] - 0s 161us/step - loss: 0.6584 - acc: 0.5325
Epoch 33/50
400/400 [==============================] - 0s 152us/step - loss: 0.6576 - acc: 0.5700
Epoch 34/50
400/400 [==============================] - 0s 128us/step - loss: 0.6568 - acc: 0.5850
Epoch 35/50
400/400 [==============================] - 0s 155us/step - loss: 0.6560 - acc: 0.5975
Epoch 36/50
400/400 [==============================] - 0s 136us/step - loss: 0.6552 - acc: 0.6425
Epoch 37/50
400/400 [==============================] - 0s 140us/step - loss: 0.6544 - acc: 0.6150
Epoch 38/50
400/400 [==============================] - 0s 120us/step - loss: 0.6538 - acc: 0.6375
Epoch 39/50
400/400 [==============================] - 0s 140us/step - loss: 0.6531 - acc: 0.6725
Epoch 40/50
400/400 [==============================] - 0s 135us/step - loss: 0.6523 - acc: 0.6750
Epoch 41/50
400/400 [==============================] - 0s 136us/step - loss: 0.6515 - acc: 0.7300
Epoch 42/50
400/400 [==============================] - 0s 126us/step - loss: 0.6505 - acc: 0.7450
Epoch 43/50
400/400 [==============================] - 0s 141us/step - loss: 0.6496 - acc: 0.7425
Epoch 44/50
400/400 [==============================] - 0s 162us/step - loss: 0.6489 - acc: 0.7675
Epoch 45/50
400/400 [==============================] - 0s 161us/step - loss: 0.6480 - acc: 0.7775
Epoch 46/50
400/400 [==============================] - 0s 126us/step - loss: 0.6473 - acc: 0.7575
Epoch 47/50
400/400 [==============================] - 0s 124us/step - loss: 0.6464 - acc: 0.7625
Epoch 48/50
400/400 [==============================] - 0s 130us/step - loss: 0.6455 - acc: 0.7950
Epoch 49/50
400/400 [==============================] - 0s 191us/step - loss: 0.6445 - acc: 0.8100
Epoch 50/50
400/400 [==============================] - 0s 163us/step - loss: 0.6435 - acc: 0.8625

The accuracy starts to increase (Note that if you train this model multiple times, each time it may take different number of epochs to reach an acceptable accuracy, anything from 10 to 100 epochs).

Also, in my experiments I noticed that increasing the number of units in the first dense layer, for example to 5 or 10 units, causes the model to be trained faster (i.e. quickly converge).

Why so many epochs needed?

I think it is because of these two reasons (combined):

1) Despite the fact that the two classes are easily separable, your data is made up of random samples, and

2) The number of data points compared to the size of neural net (i.e. number of trainable parameters, which is 9 in example code above) is relatively large.

Therefore, it takes more epochs for the model to learn the weights. It is as though the model is very restricted and needs more and more experience to correctly find the appropriate weights. As an evidence, just try to increase the number of units in the first dense layer. You are almost guaranteed to reach an accuracy of +90% with less than 10 epochs each time you attempt to train this model. Here you increase the capacity and therefore the model converges (i.e. trains) much faster (it should be noted that it starts to overfit if the capacity is too high or you train the model for too many epochs. You should have a validation scheme to monitor this issue).

Side note:

Don't set the high argument to a number less than the low argument in numpy.random.uniform since, according to the documentation, the results will be "officially undefined" in this case.

Update:

One more important thing here (maybe the most important thing in this scenario) is the learning rate of the optimizer. If the learning rate is too low, the model converges slowly. Try increasing the learning rate, and you can see you reach an accuracy of 100% with less than 5 epochs:

from keras import optimizers

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-1),
              metrics=['accuracy'])

# or you may use adam
model.compile(loss='binary_crossentropy',
              optimizer=optimizers.Adam(lr=1e-1),
              metrics=['accuracy'])