How to overcome overfitting in convolutional neural network when nothing helps?

Question

I'm training a convolutional neural network with siamese architecture and constrastive loss function for face verification task. And I'm faced with a huge difference in training and validation accuracy starting from literally first three or five epochs. When training accuracy reaches 95% I have ~65% validation accuracy. It is fluctuating somewhere near 70% but never reaches this number. these are training and validation accuracy plotted on one chart

So to avoid this I tried a range of standard techniques when it comes to overfitting, but before listing them here I should say that none of them really changes the picture. The gap between training and validation accuracy stays the same. So I used:

L1 regularization with lambda varying from 0.0001 to 10000.0
L2 regularization with lambda varying from 0.0001 to 10000.0
Dropout with rate from 0.2 to 0.8
Data augmentation techniques (rotation, shifting, zooming)
Removing fully connected layers except last layer.

Nothing of these really help, so I appreciate any advises from you guys. And some information about the network itself. I'm using tensorflow. This is how the model itself look like:

net = tf.layers.conv2d(
    inputs,
    kernel_size=(7, 7),
    filters=15,
    strides=1,
    activation=tf.nn.relu,
    kernel_initializer=w_init,
    kernel_regularizer=reg)
# 15 x 58 x 58
net = tf.layers.max_pooling2d(net, pool_size=(2, 2), strides=2)
# 15 x 29 x 29
net = tf.layers.conv2d(
    net,
    kernel_size=(6, 6),
    filters=45,
    strides=1,
    activation=tf.nn.relu,
    kernel_initializer=w_init,
    kernel_regularizer=reg)
# 45 x 24 x 24
net = tf.layers.max_pooling2d(net, pool_size=(4, 4), strides=4)
# 45 x 6 x 6
net = tf.layers.conv2d(
    net,
    kernel_size=(6, 6),
    filters=256,
    strides=1,
    activation=tf.nn.relu,
    kernel_initializer=w_init,
    kernel_regularizer=reg)
# 256 x 1 x 1
net = tf.reshape(net, [-1, 256])
net = tf.layers.dense(net, units=512, activation=tf.nn.relu, kernel_regularizer=reg, kernel_initializer=w_init)
net = tf.layers.dropout(net, rate=0.2)
# net = tf.layers.dense(net, units=256, activation=tf.nn.relu, kernel_regularizer=reg, kernel_initializer=w_init)
# net = tf.layers.dropout(net, rate=0.75)
return tf.layers.dense(net, units=embedding_size, activation=tf.nn.relu, kernel_initializer=w_init)

This is how loss function is implemented:

def contrastive_loss(out1, out2, labels, margin):
distance = compute_euclidian_distance_square(out1, out2)
positive_part = labels * distance
negative_part = (1 - labels) * tf.maximum(tf.square(margin) - distance, 0.0)
return tf.reduce_mean(positive_part + negative_part) / 2

This is how I get and augment data (I'm using LFW dataset):

ROTATIONS_RANGE = range(1, 25)
SHIFTS_RANGE = range(1, 18)
ZOOM_RANGE = (1.05, 1.075, 1.1, 1.125, 1.15, 1.175, 1.2, 1.225, 1.25, 1.275, 1.3, 1.325, 1.35, 1.375, 1.4)
IMG_SLICE = (slice(0, 64), slice(0, 64))


def pad_img(img):
    return np.pad(img, ((0, 2), (0, 17)), mode='constant')


def get_data(rotation=False, shifting=False, zooming=False):
    train_data = fetch_lfw_pairs(subset='train')
    test_data = fetch_lfw_pairs(subset='test')

    x1s_trn, x2s_trn, ys_trn, x1s_vld, x2s_vld = [], [], [], [], []

    for (pair, y) in zip(train_data.pairs, train_data.target):
        img1, img2 = pad_img(pair[0]), pad_img(pair[1])
        x1s_trn.append(img1)
        x2s_trn.append(img2)
        ys_trn.append(y)

        if rotation:
            for angle in ROTATIONS_RANGE:
                x1s_trn.append(np.asarray(rotate(img1, angle))[IMG_SLICE])
                x2s_trn.append(np.asarray(rotate(img2, angle))[IMG_SLICE])
                ys_trn.append(y)
                x1s_trn.append(np.asarray(rotate(img1, -angle))[IMG_SLICE])
                x2s_trn.append(np.asarray(rotate(img2, -angle))[IMG_SLICE])
                ys_trn.append(y)

        if shifting:
            for pixels_to_shift in SHIFTS_RANGE:
                x1s_trn.append(shift(img1, pixels_to_shift))
                x2s_trn.append(shift(img2, pixels_to_shift))
                ys_trn.append(y)
                x1s_trn.append(shift(img1, -pixels_to_shift))
                x2s_trn.append(shift(img2, -pixels_to_shift))
                ys_trn.append(y)

        if zooming:
            for zm in ZOOM_RANGE:
                x1s_trn.append(np.asarray(zoom(img1, zm))[IMG_SLICE])
                x2s_trn.append(np.asarray(zoom(img2, zm))[IMG_SLICE])
                ys_trn.append(y)

    for (img1, img2) in test_data.pairs:
        x1s_vld.append(pad_img(img1))
        x2s_vld.append(pad_img(img2))

    return (
        np.array(x1s_trn),
        np.array(x2s_trn),
        np.array(ys_trn),
        np.array(x1s_vld),
        np.array(x2s_vld),
        np.array(test_data.target)
    )

Thanks all!

You can't expect a good generalization with too few data. The very first thing to do is to gather as much data as you can. Basically, how many data do you have ? And how many labels ? — Robin
@cprakashagr no, but shouldn't dropout give the same effect? — Andrew Popow
@debzsud I have 2200 training image pairs (this is artificially augmented to ~50000 using rotation, shifting, zooming each image pair with different parameters) and 1000 validation pairs. I have only 2 labels, 1 - both images display the same person, 0 - images display different persons. — Andrew Popow
It looks like you're only putting dropout on the back end of the network - have you tried putting dropout on the convolutional layers as well? I'd try that before bothering too much with L1/L2 regularization. If this is a binary classification problem, you probably also want a sigmoid in the output (logit activation for binary classifier to yield prediction on [0,1] — T3am5hark

Abdu307 Abdu307 · Accepted Answer · 2017-07-25T07:07:08

This is common problem with small size dataset (LFW dataset size = 13,000 images).

you can try:

Transfer learning: https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/08_Transfer_Learning.ipynb
Use larger dataset (= 202599 images): http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

How to overcome overfitting in convolutional neural network when nothing helps?

2 Answers