0
votes

I'm building a neural net using TensorFlow and Python, and using the Kaggle 'First Steps with Julia' dataset to train and test it. The training images are basically a set of images of different numbers and letters picked out of Google street view, from street signs, shop names, etc. The network has 2 fully-connected hidden layers.

The problem I have is that the network will very quickly train itself to only give back one answer: the most common training letter (in my case 'A'). The output is in the form of a (62, 1) vector of probabilities, one for each number and letter (upper- and lower-case). This vector is EXACTLY the same for all input images.

I've then tried to remove all of the 'A's from my input data, at which point the network changed to only give back the next most common input type (an 'E').

So, is there some way to stop my network stopping at a local minima (not sure if that's the actual term)? Is this even a generic problem for neural networks, or is it just that my network is broken somehow?

I'm happy to provide code if it would help.

EDIT: These are the hyperparameters of my network:

Input size : 400 (20x20 greyscale images)
Hidden layer 1 size : 100
Hidden layer 2 size : 100
Output layer size : 62 (Alphanumeric, lower- and upper-case)

Training data size : 4283 images
Validation data size : 1000 images
Test data size : 1000 images

Batch size : 100
Learning rate : 0.5
Dropout rate : 0.5
L2 regularisation parameter : 0

4
Maybe you have a learning rate too high? Could you post the hyper hyperparameters of your model (architecture, learning rate, batch size...)Olivier Moindrot
I've edited my question to include hyperparametersLomaxOnTheRun

4 Answers

2
votes

Trying to squeeze blood from a stone!

I'm skeptical that with 4283 training examples your net will learn 62 categories...that's a big ask for such a small amount of data. Especially since your net is not a conv net...and it's forced to reduce its dimensionality to 100 at the first layer. You may as well pca it and save time.

Try this:
Step 1: download an MNIST example and learn how to train and run it.

Step 2: use the same mnist network design and throw your data at it...see how it goes. you may need to pad your images. Train and then run it on your test data.

Now step 3: take your fully trained step 1 mnist model and "finetune" it by continuing to train with your data(only) and with a lower learning rate for a few epochs(ultimately determine #epochs by validation). Then run it on your test data again and see how it does. Look up "transfer learning"...and a "finetuning example" for your toolkit.(Note that for finetuning you need to mod the output layer of the net)

I'm not sure how big your original source images are but you can resize them and throw a pre-trained cifar100 net at it(finetuned) or even an imagenet if the source images are big enough. Hmmm cifar/imagnet are for colour images...but you could replicate your greyscale to each rgb band for fun.

Mark my words...these steps may "seem simple"...but if you can work through it and get some results(even if they're not great results) by finetuning with your own data, you can consider yourself a decent NN technician.

One good tutorial for finetuning is on the Caffe website...flickr style(I think)...there's gotta be one for TF too.

The last step is to design your own CNN...be careful when changing filter sizes--you need to understand how it affects outputs of each layer and how information is preserved/lost.

I suppose another thing to do is to do "data augmentation" to get yourself some more of it. slight rotations/resizing/lighting...etc. Tf has some nice preprocessing for doing some of this...but some will need to be done by yourself.

good luck!

0
votes

Your learning rate is way too high. It should be around 0.01, you can experiment around it but 0.5 is too high.

With a high learning rate, the network is likely to get stuck in a configuration and output something fixed, like you observed.


EDIT

It seems the real problem is the unbalanced classes in the dataset. You can try:

  • to change the loss so that less frequent examples get a higher loss
  • change your sampling strategy by using balanced batches of data. When selecting the 64 examples in your batch, select randomly in the dataset but with the same probability for each class.
0
votes

Which optimizer are you using? If you've only tried gradient descent, try using one of the adaptive ones (e.g. adagrad/adadelta/adam).

0
votes

I'm afraid that this was a rookie mistake. When turning the data from a folder of images to a single .pickle file, I used:

  imageFileNames = os.listdir(folder)

to get the all the image file names in that folder. As it turns out, this returns the file names in an arbitrary order. This means that I had matched up ordered labels to random images.

The network then found that the best it could do was to make every image input have the same output vector, matching up the most common training image, 'A'. If I took all 'A's out of the training data, it would then do the same with the next most common training image 'E'.

Moral of the story: Always make sure your inputs are what you expect them to be. Just check a few out by sight to make sure they look correct.

A huge thanks to everyone who gave advice, I actually learned loads from this :-)