Optimizing self coded 2-layer Artificial Neural Network

Question

I've recently started learning about Neural Networks and decided to code my own simple 2-layer ANN and benchmark it using the MNIST dataset. I have tried to program it using batch SGD where the batch size is provided by the user. My code is as follows :

class NeuralNetwork:
    def __init__(self, inodes, hnodes, outnodes, activation_func, learning_rate):
        self.inodes = inodes
        self.hnodes = hnodes
        self.onodes = outnodes
        self.activation_function = activation_func
        self.lr = learning_rate
        self.wih = np.random.randn(self.hnodes, self.inodes) / pow(self.inodes, 0.5)
        self.who = np.random.randn(self.onodes, self.hnodes) / pow(self.hnodes, 0.5)

    def train(self, training_data, target_labels, batch=1, l2_penalty=0, verbose=False):
        batch_size = len(training_data) / batch
        print "Starting to train........"
        for i in range(batch):
            train_data_batch = training_data[batch_size*i : batch_size*(i+1)]
            label_batch = target_labels[batch_size*i : batch_size*(i+1)]
            batch_error = self.train_batch(train_data_batch, label_batch, l2_penalty)
            if verbose:
                print "Batch : " + str(i+1) + " ; Error : " + str(batch_error)
        print "..........Finished!"

    def train_batch(self, training_data, target_labels, l2_penalty=0):
        train = np.array(training_data, ndmin=2).T
        label = np.array(target_labels, ndmin=2).T

        inputs = train # IxN
        hidden_input = np.dot(self.wih, inputs) # (HxI).(IxN) = HxN
        hidden_ouputs = self.activation_function(hidden_input) # (HxN) -> (HxN)

        final_input = np.dot(self.who, hidden_ouputs) # (OxH).(HxN) -> OxN
        final_outputs = self.activation_function(final_input) # OxN -> OxN

        final_outputs = np.exp(final_outputs) # OxN
        for f in range(len(final_outputs)):
            final_outputs[f] = final_outputs[f] / sum(final_outputs[f])

        final_error_wrt_out = label - final_outputs # OxN
        hidden_error_wrt_out = np.dot(self.who.T, final_outputs) # HxN

        final_in_wrt_out = self.activation_function(final_input, der=True) # OxN
        hidden_in_wrt_out = self.activation_function(hidden_input, der=True) # HxN

        grad_who = np.dot(final_error_wrt_out * final_in_wrt_out, hidden_ouputs.T) # (OxN).(NxH) -> OxH
        grad_wih = np.dot(hidden_error_wrt_out * hidden_in_wrt_out, inputs.T) # (HxN).(NxI) -> HxI

        self.who = self.who - self.lr * (grad_who + l2_penalty*(self.who))
        self.wih = self.wih - self.lr * (grad_wih + l2_penalty*(self.wih))

        return np.sum(final_error_wrt_out * final_error_wrt_out) / (2*len(training_data))

    def query(self, inputs):
        if len(inputs) != self.inodes:
            print "Invalid input size"
            return
        inputs = np.array(inputs)
        hidden_input = np.dot(self.wih, inputs)
        hidden_ouputs = self.activation_function(hidden_input)

        final_input = np.dot(self.who, hidden_ouputs)
        final_outputs = self.activation_function(final_input)

        final_outputs = np.exp(final_outputs)
        total = sum(final_outputs)
        probs = final_outputs / total

        return probs

I found a similar code by Tariq Rashid on github which gives about 95% accuracy. My code on the other hand is giving only 10%.

I have tried debugging the code multiple times referring to various tutorials on Backpropogation but have not been able to improve my accuracy. I'd appreciate any insight into the issue.

Edit 1: This is following the answer by mattdeak.

I had previously used MSE instead of Negative Log Likelihood error for the softmax layer, an error on my part. Following the answer I have changed the train function as follows:

def train_batch(self, training_data, target_labels, l2_penalty=0):
    train = np.array(training_data, ndmin=2).T
    label = np.array(target_labels, ndmin=2).T

    inputs = train # IxN
    hidden_input = np.dot(self.wih, inputs) # (HxI).(IxN) = HxN
    hidden_ouputs = self.activation_function(hidden_input) # (HxN) -> (HxN)

    final_input = np.dot(self.who, hidden_ouputs) # (OxH).(HxN) -> OxN
    final_outputs = self.activation_function(final_input) # OxN -> OxN

    final_outputs = np.exp(final_outputs) # OxN
    for f in range(len(final_outputs)):
        final_outputs[f] = final_outputs[f] / sum(final_outputs[f])

    error = label - final_outputs

    final_error_wrt_out = final_outputs - 1 # OxN
    hidden_error_wrt_out = np.dot(self.who.T, -np.log(final_outputs)) # (HxO).(OxN) -> HxN

    final_in_wrt_out = self.activation_function(final_input, der=True) # OxN
    hidden_in_wrt_out = self.activation_function(hidden_input, der=True) # HxN

    grad_who = np.dot(final_error_wrt_out * final_in_wrt_out, hidden_ouputs.T) # (OxN).(NxH) -> OxH
    grad_wih = np.dot(hidden_error_wrt_out * hidden_in_wrt_out, inputs.T) # (HxN).(NxI) -> HxI

    self.who = self.who - self.lr * (grad_who + l2_penalty*(self.who))
    self.wih = self.wih - self.lr * (grad_wih + l2_penalty*(self.wih))

    return np.sum(final_error_wrt_out * final_error_wrt_out) / (2*len(training_data))

However this has not led to any performance gain.

@mattdeak it is indeed softmax regression and I am ultimately doing np.exp(final_outputs)/np.sum(np.exp(final_outputs)). This result is stored in the 'probs' variable using the 'for' loop immediately after final_outputs = np.exp(final_outputs). I found it easier to do this operation over multiple lines as it helped me debug the program better. — Chaitanya

mattdeak mattdeak · Accepted Answer · 2017-02-02T13:05:34

I don't think you are backpropagating through the softmax layer in your train step. If I'm not mistaken, I believe the gradient of the softmax can be computed as simply as:

grad_softmax = final_outputs - 1

Optimizing self coded 2-layer Artificial Neural Network

1 Answers