Issue with variable image input resolution on MNIST dataset (while using CNN)

Question

I'm a bit new when it comes to CNN, so please correct me wherever possible!

I've been experimenting with the MNIST dataset for digit classification. I decided to take it one step further by passing my own handwritten digit into the predict method of the model. I am aware that the MaxPooling2D layer only allows fixed input resolution, so after some research, I used GlobalMaxPooling2D. This solved the problem of variable input image resolution. The problem I am facing right now, is that the predict method accurately predicts images from the test set of the MNIST dataset, but is unable to predict my own handwritten digits. This is my Model:

model=Sequential()
model.add(Conv2D(128,(5,5),input_shape=(None,None,1), data_format='channels_last'))
model.add(Dense(80, activation='relu'))
model.add(GlobalMaxPooling2D())
model.add(BatchNormalization())
model.add(Dense(10,activation='softmax'))

The model gives a training accuracy of 94.98% and a testing accuracy of 94.52%. For predicting my own handwritten digit, I used an image of resolution 200x200. The model somehow can predict specific digits like 8, 6 and 1, but when I test any other digit, it still classifies it into 8, 6 or 1. Can anyone please point out where I'm going wrong? Any help is appreciated!

I am not sure how your code trains. Going from Conv2D to Dense without transforming the 2D spatial information into 1D for Dense will throw a inconsistent dimension error. Did you meant to switch the Global Max Pooling with the Dense? — rayryeng
Make sure that when you input your hand written images you do EXACTLY the same pre processing that you did for the training images for example rescaling, resizing etc — Gerry P
@rayryeng, thanks, I added a flatten() layer after globalmaxpool2D and its giving better predictions on my test handwritten set. I'm not quite sure why I did not face inconsistent dimension error in the previous model though. — Advait Shirvaikar
@GerryP, yes, I did that. It still didn't improve its ability to identify them. — Advait Shirvaikar
I still don't see how adding a flatten after the pooling as you've done here gives you better accuracy. It doesn't make sense to go from conv to dense unless you consolidate the spatial information in a way that is compatible with the dense layers. It mathematically makes no sense to me. — rayryeng

Hossein Hossein · Accepted Answer · 2020-11-09T10:25:17

There are several things that can contribute to what you are seeing here. The optimization process is not good, the way you optimize your model, can have a direct effect on how your model performs. The proper choice of an optimizer, learning rate, learning rate decay regime, proper regularization, are just some to name. Other than that your network is very simple and very badly designed. You do not have enough Conv layers to utilize the image structures and provide good abstractions that can be used to do what you are asking it to do. You model is not deep enough either.

MNIST by itself is a very easy task, using a linear classifier you can achieve around the very accuracy you achieved, maybe even better. this shows you are not exploiting the CNNs or deep architectures capabilities in any good way. even a simple one or two fully connected layers should give you better accuracy if properly trained.

Try making your network deeper, use more CNN layers, followed by BatchNormalization and then ReLUs, and avoid quickly downsampling the input featuremaps. When you downsample, you lose information, to make up for that, you usually want to increase the filters on the next layer to compensate for the decreased representational capacity caused by this. In other words, try to gradually decrease the featuremap's dimensions and likewise, increase the neurons number.

A huge number of neurons in the beginning is wasteful for you specific use-case, 32/64 can be more than enough, as the network gets deeper, more abstract features are built upon more primitive ones found in the early layers, so having more neurons at the later layers are more reasonable usually.

Early layers are responsible for creating primitive filters and after some point, more filters won't help in performance, it just creates duplicated work that's already been done by some previous filter.

The reason you see a difference in accuracy, is simply because you ended up in another local minima! With the same exact config, if you train 100 times, you will get 100 different results, some better than others and some worse than the others, never the same exact value, unless you use deterministic behavior by using a specific seed and only run in cpu mode.

Issue with variable image input resolution on MNIST dataset (while using CNN)

1 Answers