Tensorflow Keras - High accuracy during training, low accuracy during prediction

Question

I have a very basic multiclass CNN model for classifying vehicles into 4 classes [pickup, sedan, suv, van] that I have written using Tensorflow 2.0 tf.keras:

he_initialiser = tf.keras.initializers.VarianceScaling()
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3,3), input_shape=(3,128,128), activation='relu', padding='same', data_format='channels_first', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3,3), activation='relu', padding='same', data_format='channels_first', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.MaxPooling2D((2, 2), data_format=cfg_data_fmt))
model.add(tf.keras.layers.Conv2D(64, kernel_size=(3,3), activation='relu', padding='same', data_format='channels_first', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.Conv2D(64, kernel_size=(3,3), activation='relu', padding='same', data_format='channels_first', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.MaxPooling2D((2, 2), data_format=cfg_data_fmt))
model.add(tf.keras.layers.Conv2D(128, kernel_size=(3,3), activation='relu', padding='same', data_format='channels_first', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.Conv2D(128, kernel_size=(3,3), activation='relu', padding='same', data_format='channels_first', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.MaxPooling2D((2, 2), data_format='channels_first'))
model.add(tf.keras.layers.Flatten(data_format='channels_first'))
model.add(tf.keras.layers.Dense(128, activation='relu', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.Dense(128, activation='relu', kernel_initializer=he_initialiser))
model.add(tf.keras.layers.Dense(4, activation='softmax', kernel_initializer=he_initialiser))

I use the following configuration for training:

Image size: 3x128x128 (planar data)
Number of epochs: 45
Batch size: 32
Loss function: tf.keras.losses.CategoricalCrossentropy(from_logits=True)
Optimizer: optimizer=tf.optimizers.Adam
training data size: 67.5% of all data
validation data size: 12.5% of all data
test data size: 20% of all data

I have an unbalanced dataset, which has the following distribution:

pickups: 1202
sedans: 1954
suvs: 2510
vans: 196

For this reason I have used class weights to mitigate this imbalance:

pickup_weight: 4.87
sedan_weight: 3.0
suv_weight: 2.33
van_weight: 30.0

This seems like a small dataset but I am using this for fine tuning since I first train the model on a larger dataset of 16k images of these classes, though with images of vehicles taken from different angles as compared to my fine tune dataset.

Now the questions that I'm having stem from the following observations:

At the end of the final epoch, the results returned by model.fit gave:

training accuracy of 0.9229
training loss of 3.5055
validation accuracy of 0.7906
validation loss of 0.9382
training precision for class pickup of 0.9186
training precision for class sedan of 0.9384
training precision for class suv of 0.9196
training precision for class van of 0.8378
validation precision for class pickup of 0.7805
validation precision for class sedan of 0.8026
validation precision for class suv of 0.0.8029
validation precision for class van of 0.4615

The results returned by model.evaluate on my hold-out test set after training gave similar accuracy and loss values as the corresponding validation values in the last epoch and the precision values for each class were also nearly identical to the corresponding validation precisions.

The lower, but still high enough, validation accuracy leads me to believe there is no overfitting problem as the model can generalize.

My first question is how can the validation loss be so much lower than the training loss?

Furthermore, when I created a confusion matrix using:

test_images = np.array([x[0].numpy() for x in list(labeled_ds_test)])
test_labels = np.array([x[1].numpy() for x in list(labeled_ds_test)])
test_predictions = model.predict(test_images, batch_size=32)
print(tf.math.confusion_matrix(tf.argmax(test_labels, 1), tf.argmax(test_predictions, 1)))

The results I got back were:

tf.Tensor(
[[ 42  85 109   3]
 [ 72 137 177   4]
 [ 91 171 228  11]
 [  9  12  16   1]], shape=(4, 4), dtype=int32)

This shows an accuracy of only 35%!!

My second question is therefore this: how can the accuracy given by model.predict be so small when during training and evaluation the values seemed to indicate that my model was quite precise with its predictions?

Am I using the predict method wrong or is my theoretical understanding of what's expected to happen completely off?

I am at a bit of a loss here and would greatly appreciate any feedback. Thanks for reading this.

When training accuracy is high, and prediction accuracy is low that's a sure sign of overfitting. I recommend looking into the causes and solutions for overfitting. — gallen

Alihan ÖZ Alihan ÖZ · Accepted Answer · 2020-06-26T22:33:20

I aggree @gallen. There are several reason that can cause overfitting and several methods for preventing overfitting. One of the good solutions is adding dropout between layers. You can see stackoverflow answer and towardsdatascience article

Tensorflow Keras - High accuracy during training, low accuracy during prediction

3 Answers