I've been trying to train my model for the past few days, but each time no matter what I try, I get the same issue. My accuracy starts low and reaches above 90% within the first epoch, but validation at the end of each epoch comes out as anywhere between 20-50%, and testing the model predictions is accurate with some classes but completely wrong for most. My dataset has 20000 images, 2000 per class, and 100 testing images(can get more if necessary). I would greatly appreciate any input any of you have, as I am quite new to machine learning as a whole and I do not completely understand everything that goes into this.
I've looked at several posts and articles online describing similar issues and their fixes, whether it be defining activations as their own layers rather than parameters, adding in batch normalization layers and changing their momentums, trying several different optimizers and learning rates, different sizes of datasets, using a custom initializer, and even completely changing the structure of my model. Nothing works.
Here is the main part of the network:
model = Sequential()
initializer = keras.initializers.he_normal(seed=None)
model.add(Conv2D(64, (3, 3), padding='same', use_bias=False, kernel_initializer=initializer, input_shape=x_train.shape[1:]))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3), padding='same', use_bias=False, kernel_initializer=initializer))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), padding='same', use_bias=False, kernel_initializer=initializer))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(128, (3, 3), padding='same', use_bias=False, kernel_initializer=initializer))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (3, 3), padding='same', use_bias=False, kernel_initializer=initializer))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(256, (3, 3), padding='same', use_bias=False, kernel_initializer=initializer))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(2048, use_bias=False, kernel_initializer=initializer))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.4))
model.add(Dense(num_classes, use_bias=False))
model.add(BatchNormalization())
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(lr=0.00005), metrics=['accuracy'])
# train the model
if not testing_mode:
model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, shuffle=True, validation_data=(x_test, y_test))
scores = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
Here is the last few batches of an epoch and its validation at the end:
19776/20000 [============================>.] - ETA: 25s - loss: 0.4859 - acc: 0.9707
19840/20000 [============================>.] - ETA: 18s - loss: 0.4855 - acc: 0.9708
19904/20000 [============================>.] - ETA: 11s - loss: 0.4851 - acc: 0.9709
19968/20000 [============================>.] - ETA: 3s - loss: 0.4848 - acc: 0.9710
20000/20000 [==============================] - 2323s 116ms/step - loss: 0.4848 - acc: 0.9710 - val_loss: 1.9185 - val_acc: 0.5000
Edit: I've been told to add more info about my dataset. I am training on this dataset with 10 classes of different hand gestures. Each image is preprocessed to be 128x128 and grayscale, and my 100-image testing set is 10 images taken from every class in the training set. I know it's better to get data that's separate from the training set for testing, but I wasn't sure if deleting images from the training set was a good idea. This is also one of the reasons why I think this issue is strange, because if the model is overfitting onto the training data, then why is the accuracy so low when faced with data it has already seen? Let me know if you need any more information.