2
votes

I'm using normalized MNIST dataset (input features = 784). My network architecture is 784-256-256-10: two hidden layers of 256 neurons each using sigmoid activation functions, and softmax activation at 10-neuron output layer. Also I'm using the Cross-Entropy cost function.

Weight matrix initialization:

input_size=784
hidden1_size=256
hidden2_size=256
output_size=10
Theta1 = np.random.randn(hidden1_size, input_size)
b1 = np.random.randn(hidden1_size)

Theta2 = np.random.randn(hidden2_size, hidden1_size)
b2 = np.random.randn(hidden2_size)

Theta3 = np.random.randn(output_size, hidden2_size)
b3 = np.random.randn(output_size)

My network works as expected here:

epochs = 2000
learning_rate = 0.01
for j in range(epochs):
    # total_train is an array of length 50000
    # Each element of total_train is a tuple of: (a) input vector of length 784
    # and (b) the corresponding one-hot encoded label of length 10
    # Similarly, total_test is an array of length 10000
    shuffle(total_train)
    train = total_train[:1000]
    shuffle(total_test)
    test = total_test[:1000]
    predictions = []
    test_predictions = []
    for i in range(len(train)):
        # Feed forward
        x, t = train[i][0], train[i][1]
        z1 = np.dot(Theta1, x) + b1
        a1 = sigmoid(z1)
        z2 = np.dot(Theta2, a1) + b2
        a2 = sigmoid(z2)
        z3 = np.dot(Theta3, a2) + b3
        y = softmax(z3)
        # Is prediction == target?
        predictions.append(np.argmax(y) == np.argmax(t))

        # Negative log probability cost function
        cost = -t * np.log(y)

        # Backpropagation
        delta3 = (y - t) * softmax_prime(z3)
        dTheta3 = np.outer(delta3, a2)
        db3 = delta3

        delta2 = np.dot(Theta3.T, delta3) * sigmoid_prime(z2)
        dTheta2 = np.outer(delta2, a1)
        db2 = delta2

        delta1 = np.dot(Theta2.T, delta2) * sigmoid_prime(z1)
        dTheta1 = np.outer(delta1, x)
        db1 = delta1

        # Update weights
        Theta1 -= learning_rate * dTheta1
        b1 -= learning_rate * db1
        Theta2 -= learning_rate * dTheta2
        b2 -= learning_rate * db2
        Theta3 -= learning_rate * dTheta3
        b3 -= learning_rate * db3

    if j % 10 == 0:
        m = len(predictions)
        performance = sum(predictions)/m
        print('Epoch:', j, 'Train performance:', performance)

    # Test accuracy on test data
    for i in range(len(test)):
        # Feed forward
        x, t = test[i][0], test[i][1]
        z1 = np.dot(Theta1, x) + b1
        a1 = sigmoid(z1)
        z2 = np.dot(Theta2, a1) + b2
        a2 = sigmoid(z2)
        z3 = np.dot(Theta3, a2) + b3
        y = softmax(z3)
        # Is prediction == target?
        test_predictions.append(np.argmax(y) == np.argmax(t))

    m = len(test_predictions)
    performance = sum(test_predictions)/m
    print('Epoch:', j, 'Test performance:', performance)

Output (Every 10 epochs):

Epoch: 0 Train performance: 0.121
Epoch: 0 Test performance: 0.146
Epoch: 10 Train performance: 0.37
Epoch: 10 Test performance: 0.359
Epoch: 20 Train performance: 0.41
Epoch: 20 Test performance: 0.433
Epoch: 30 Train performance: 0.534
Epoch: 30 Test performance: 0.52
Epoch: 40 Train performance: 0.607
Epoch: 40 Test performance: 0.601
Epoch: 50 Train performance: 0.651
Epoch: 50 Test performance: 0.669
Epoch: 60 Train performance: 0.71
Epoch: 60 Test performance: 0.711
Epoch: 70 Train performance: 0.719
Epoch: 70 Test performance: 0.694
Epoch: 80 Train performance: 0.75
Epoch: 80 Test performance: 0.752
Epoch: 90 Train performance: 0.76
Epoch: 90 Test performance: 0.758
Epoch: 100 Train performance: 0.766
Epoch: 100 Test performance: 0.769

But when I introduce Dropout regularization scheme, my network breaks. My code updates for dropout are:

dropout_prob = 0.5

# Feed forward
x, t = train[i][0], train[i][1]
z1 = np.dot(Theta1, x) + b1
a1 = sigmoid(z1)
mask1 = np.random.random(len(z1))
mask1 = mask1 < dropout_prob
a1 *= mask1
z2 = np.dot(Theta2, a1) + b2
a2 = sigmoid(z2)
mask2 = np.random.random(len(z2))
mask2 = mask2 < dropout_prob
a2 *= mask2
z3 = np.dot(Theta3, a2) + b3
y = softmax(z3)

# Backpropagation
delta3 = (y - t) * softmax_prime(z3)
dTheta3 = np.outer(delta3, a2)
db3 = delta3 * 1

delta2 = np.dot(Theta3.T, delta3) * sigmoid_prime(z2)
dTheta2 = np.outer(delta2, a1)
db2 = delta2 * 1

delta1 = np.dot(Theta2.T, delta2) * sigmoid_prime(z1)
dTheta1 = np.outer(delta1, x)
db1 = delta1 * 1

The performance stays at around 0.1 (10%).

Any pointers on where I'm going wrong is much appreciated.

1
50% dropout is way too much for such a small network.Thomas Jungblut
@ThomasJungblut Tried with dropout_prob = 0.9 and a smaller learning rate of 0.005, but not much improvement. Now the accuracy is around 20%, but almost constant. I tried plotting the costs vs epochs, and this is what I see. (When I set dropout_prob = 1, it works the same as without dropout, and accuracy rises as expected.)ishwr_
What do you mean by 'my network breaks'? One thought; did you change the gradient descent update to include the dropout_prob?jonnybazookatone
Did you turn off dropout at test time?Imran
@jonnybazookatone your comment made me go through Hinton's paper again, where it says "dropout net should typically use 10-100 times the learning rate that was optimal for a standard neural net". 0.03 seemed an ideal learning rate for my net without dropout, and my net trained up to 15% error rate. I increased the learning rate when I used Dropout, but lowest error was still ~80%. Also note that I've tried training it with 5 times more epochs than without dropout.ishwr_

1 Answers

0
votes

There is a major issue in your implementation of dropout, because you're not scaling the activations on test time. Here's the quote from the great CS231n tutorial:

Crucially, note that in the predict function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by p.

This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of p=0.5, the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation).

To see this, consider an output of a neuron x (before dropout). With dropout, the expected output from this neuron will become px+(1−p)0, because the neuron’s output will be set to zero with probability 1−p. At test time, when we keep the neuron always active, we must adjust x→px to keep the same expected output.

It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction.

The most common solution is to use an inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched. This is how it looks like in code:

mask1 = (mask1 < dropout_prob) / dropout_prob
...
mask2 = (mask2 < dropout_prob) / dropout_prob
...