What is the difference between MLP implementation from scratch and in PyTorch?

Question

Following up the question from How to update the learning rate in a two layered multi-layered perceptron?

Given the XOR problem:

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

And a simple

two layered Multi-Layered Perceptron (MLP) with
sigmoid activations between them and
Mean Square Error (MSE) as the loss function/optimization criterion

If we train the model from scratch as such:

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return 0.5 * np.mean(np.square(predicted - truth))

def mse_derivative(predicted, truth):
    return predicted - truth

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

# Initialize weigh
num_epochs = 5000
learning_rate = 0.3

losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    cost_error = mse(layer2, Y)
    cost_delta = mse_derivative(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_error = np.dot(cost_delta, cost_error)
    layer2_delta = cost_delta *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())
    #print(cost_delta)


# Visualize the losses
plt.plot(losses)
plt.show()

We get a sharp dive in the loss from epoch 0 and then saturates quickly:

But if we train a similar model with pytorch, the training curve has a gradual drop in losses before saturating:

What is the difference between the MLP from scratch and the PyTorch code?

Why is it achieving convergence at different point?

Other than the weights initialization, np.random.rand() in the code from scratch and the default torch initialization, I can't seem to see a difference in the model.

Code for PyTorch:

from tqdm import tqdm
import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim

import matplotlib.pyplot as plt

torch.manual_seed(0)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

# XOR gate inputs and outputs.
X = xor_input = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = xor_output = tensor([[0],[1],[1],[0]]).float().to(device)


# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
print('Inputs Dim:', input_dim) # i.e. n=2 

num_data, output_dim = Y.shape
print('Output Dim:', output_dim) 
print('No. of Data:', num_data) # i.e. n=4

# Step 1: Initialization. 

# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
            # Use nn.Linear to get our simple perceptron.
            nn.Linear(input_dim, hidden_dim),
            # Use nn.Sigmoid to get our sigmoid non-linearity.
            nn.Sigmoid(),
            # Second layer neurons.
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )
model

# Initialize the optimizer
learning_rate = 0.3
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Initialize the loss function.
criterion = nn.MSELoss()

# Initialize the stopping criteria
# For simplicity, just stop training after certain no. of epochs.
num_epochs = 5000 

losses = [] # Keeps track of the loses.

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    losses.append(loss.data.item())


plt.plot(losses)

It might help if you could tell us how sharp the dive is in the hand-coded example. 2 epochs? 20? The obvious interpretation of the graph for me is that the learning rate is somehow very different. (Also, as a separate note: the MSE-loss is probably not the appropriate error function here, and you'll want to use a negative log loss/cross entropy loss in practice for outputs in $[0, 1]$, but for a problem this simple it shouldn't really matter, and of course it is not especially relevant to the question.) — Mees de Vries
Your from-scratch code throws ---> 60 layer1_error = np.dot(layer2_delta, W2.T) ..... ValueError: shapes (4,50) and (1,5) not aligned: 50 (dim 1) != 1 (dim 0) — cs95
@alvas At the end of training, your loss should ideally be around 0.0, right? Doesn't that imply that something is wrong with the PyTorch code? @coldspeed I'm able to reproduce the OP's results from the from-scratch code. It seems like somehow layer2_delta is ending up with shape (4, 50) when you run it (for me layer2_delta.shape is (4,1)). — tel
Okay, it took a little while, but I figured out how to get your hand-rolled code to produce results identical to those of the Pytorch code. There are 4 significant differences to account for. It's all minor tweaks, so it looks like the core of your hand-rolled code is good (except for the thing with having to double the learning rate: that's probably a math error somewhere). — tel

tel tel · Accepted Answer · 2019-01-20T12:53:56

List of differences between hand-rolled code and PyTorch code

Turns out there are a lot of differences between what your hand-rolled code and the PyTorch code are doing. Here's what I uncovered, listed roughly in order of most to least impact on the output:

Your code and the PyTorch code use two different functions to report the loss.
Your code and the PyTorch code set up the initial weights very differently. You mention this in your question, but it turns out to have a pretty significant impact on the results.
By default, the torch.nn.Linear layers add an extra bunch of "bias" weights to the model. Thus, the 1st layer of the Pytorch model effectively has 3x5 weights and the second layer has 6x1 weights. The layers in the hand-rolled code have 2x5 and 5x1 weights, respectively.
- The bias seems to help the model to learn and adapt somewhat faster. If you turn the bias off, it takes roughly twice as many training epochs for the Pytorch model to reach near 0 loss.
Curiously, it seems like the Pytorch model is using a learning rate that is effectively half of what you specify. Alternatively, it may be that there's a stray factor of 2 that's found its way into your hand-rolled math/code somewhere.

How to get identical results from the hand-rolled and Pytorch code

By carefully accounting for the above 4 factors, it is possible to achieve complete parity between the hand-rolled and the Pytorch code. With the correct tweaks and settings, the two snippets will produce identical results:

The most important tweak - make the loss reporting functions match

The critical difference is that you end up using two completely different functions to measure the loss in the two code snippets:

In the hand rolled code, you measure the loss as layer2_error.mean(). If you unpack the variable, you can see that layer2_error.mean() is a somewhat screwy and meaningless value:

layer2_error.mean()
== np.dot(cost_delta, cost_error).mean()
== np.dot(mse_derivative(layer2, Y), mse(layer2, Y)).mean()
== np.sum(.5 * (layer2 - Y) * ((layer2 - Y)**2).mean()).mean()

On the other hand, in the PyTorch code the loss is measured in terms of the traditional definition of the mse, ie as the equivalent of np.mean((layer2 - Y)**2). You can prove this to yourself by modifying your PyTorch loop like so:

def mse(x, y):
    return np.mean((x - y)**2)

torch_losses = [] # Keeps track of the loses.
torch_losses_manual = [] # for comparison

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    torch_losses.append(loss.data.item())
    torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))

plt.plot(torch_losses, lw=5, label='torch_losses')
plt.plot(torch_losses_manual, lw=2, label='torch_losses_manual')
plt.legend()

Output:

Also important - use the same initial weights

PyTorch uses it's own special routine for setting the initial weights which produces very different results from np.random.rand. I haven't been able to exactly replicate it yet, but for the next best thing we can just hijack Pytorch. Here's a function that will get the same initial weights that the Pytorch model uses:

import torch
from torch import nn
torch.manual_seed(0)

def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
    model = nn.Sequential(
        nn.Linear(nodes_in, nodes_hidden, bias=bias),
        nn.Sigmoid(),
        nn.Linear(nodes_hidden, nodes_out, bias=bias),
        nn.Sigmoid()
    )

    return [t.detach().numpy() for t in model.parameters()]

Finally - in Pytorch, turn off all bias weights and double the learning rate

Eventually you may want to implement the bias weights in your own code. For now, we'll just turn the bias off in the Pytorch model and compare the results of the hand-rolled model to those of the unbiased Pytorch model.

Also, in order to make the results match you need to double the Pytorch model's learning rate. This effectively scales the results along the x axis (ie doubling the rate means that it takes half as many epochs to reach some particular feature on the loss curve).

Putting it together

In order to reproduce the hand_rolled_losses data from the plot at the start of my post, all you need to do is take your hand-rolled code and replace the mse function with:

def mse(predicted, truth):
    return np.mean(np.square(predicted - truth))

the lines that initialize the weights with:

W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]

and the line that tracks the losses with:

losses.append(cost_error)

and you should be good to go.

In order to reproduce the torch_losses data from the plot we'll also need to turn the bias weights off in the Pytorch model. To do that, you just need to change the lines defining the Pytorch model like so:

model = nn.Sequential(
    # Use nn.Linear to get our simple perceptron.
    nn.Linear(input_dim, hidden_dim, bias=None),
    # Use nn.Sigmoid to get our sigmoid non-linearity.
    nn.Sigmoid(),
    # Second layer neurons.
    nn.Linear(hidden_dim, output_dim, bias=None),
    nn.Sigmoid()
)

You also need to change the line defining the learning_rate:

learning_rate = 0.3 * 2

Complete code listing

The hand-rolled code

Here's a complete listing of my version of the hand-rolled neural net code, to help with reproducing my results:

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats
import torch
from torch import nn

np.random.seed(0)
torch.manual_seed(0)

def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
    model = nn.Sequential(
        nn.Linear(nodes_in, nodes_hidden, bias=bias),
        nn.Sigmoid(),
        nn.Linear(nodes_hidden, nodes_out, bias=bias),
        nn.Sigmoid()
    )

    return [t.detach().numpy() for t in model.parameters()]

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.mean(np.square(predicted - truth))

def mse_derivative(predicted, truth):
    return predicted - truth

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Define the shape of the output vector. 
output_dim = len(Y.T)

W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]

num_epochs = 5000
learning_rate = 0.3
losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    cost_delta = mse_derivative(layer2, Y)
    layer2_delta = cost_delta *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)

    # Log the loss value as we proceed through the epochs.
    losses.append(mse(layer2, Y))

# Visualize the losses
plt.plot(losses)
plt.show()

The Pytorch code

import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim

torch.manual_seed(0)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

num_epochs = 5000
learning_rate = 0.3 * 2

# XOR gate inputs and outputs.
X = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = tensor([[0],[1],[1],[0]]).float().to(device)

# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
num_data, output_dim = Y.shape

# Step 1: Initialization. 

# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
    # Use nn.Linear to get our simple perceptron.
    nn.Linear(input_dim, hidden_dim, bias=None),
    # Use nn.Sigmoid to get our sigmoid non-linearity.
    nn.Sigmoid(),
    # Second layer neurons.
    nn.Linear(hidden_dim, output_dim, bias=None),
    nn.Sigmoid()
)

# Initialize the optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Initialize the loss function.
criterion = nn.MSELoss()

def mse(x, y):
    return np.mean((x - y)**2)

torch_losses = [] # Keeps track of the loses.
torch_losses_manual = [] # for comparison

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    torch_losses.append(loss.data.item())
    torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))

plt.plot(torch_losses, lw=5, c='C1', label='torch_losses')
plt.plot(torch_losses_manual, lw=2, c='C2', label='torch_losses_manual')
plt.legend()

Notes

Bias weights

You can find some very instructive examples that show what the bias weights are and how to implement them in this tutorial. They list a bunch of pure-Python implementations of neural nets very similar to your hand-rolled one, so it's likely you could adapt some of their code to make your own implementation of the bias.

Functions to produce an initial guess of the weights

Here's a function I adapted from that same tutorial that can produce reasonable initial values for the weights. I think the algorithm Pytorch uses internally is somewhat different, but this produces similar results:

import scipy as sp
import scipy.stats

def tnorm_weights(nodes_in, nodes_out, bias_node=0):
    # see https://www.python-course.eu/neural_network_mnist.php
    wshape = (nodes_out, nodes_in + bias_node)
    bound = 1 / np.sqrt(nodes_in)
    X = sp.stats.truncnorm(-bound, bound)
    return X.rvs(np.prod(wshape)).reshape(wshape)