Following up the question from How to update the learning rate in a two layered multi-layered perceptron?
Given the XOR problem:
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
And a simple
- two layered Multi-Layered Perceptron (MLP) with
- sigmoid activations between them and
- Mean Square Error (MSE) as the loss function/optimization criterion
If we train the model from scratch as such:
from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx):
# See
return sx * (1 - sx)
# Cost functions.
def mse(predicted, truth):
return 0.5 * np.mean(np.square(predicted - truth))
def mse_derivative(predicted, truth):
return predicted - truth
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
# Initialize weigh
num_epochs = 5000
learning_rate = 0.3
losses = []
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(, W1))
layer2 = sigmoid(, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error =, cost_error)
layer2_delta = cost_delta * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error =, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += - learning_rate *, layer2_delta)
W1 += - learning_rate *, layer1_delta)
#print(, layer1_delta))
#print(epoch_n, list((layer2)))
# Log the loss value as we proceed through the epochs.
# Visualize the losses
We get a sharp dive in the loss from epoch 0 and then saturates quickly:
But if we train a similar model with pytorch
, the training curve has a gradual drop in losses before saturating:
What is the difference between the MLP from scratch and the PyTorch code?
Why is it achieving convergence at different point?
Other than the weights initialization, np.random.rand()
in the code from scratch and the default torch initialization, I can't seem to see a difference in the model.
Code for PyTorch:
from tqdm import tqdm
import numpy as np
import torch
from torch import nn
from torch import tensor
from torch import optim
import matplotlib.pyplot as plt
device = 'gpu' if torch.cuda.is_available() else 'cpu'
# XOR gate inputs and outputs.
X = xor_input = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = xor_output = tensor([[0],[1],[1],[0]]).float().to(device)
# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
print('Inputs Dim:', input_dim) # i.e. n=2
num_data, output_dim = Y.shape
print('Output Dim:', output_dim)
print('No. of Data:', num_data) # i.e. n=4
# Step 1: Initialization.
# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
# Use nn.Linear to get our simple perceptron.
nn.Linear(input_dim, hidden_dim),
# Use nn.Sigmoid to get our sigmoid non-linearity.
# Second layer neurons.
nn.Linear(hidden_dim, output_dim),
# Initialize the optimizer
learning_rate = 0.3
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
# Initialize the loss function.
criterion = nn.MSELoss()
# Initialize the stopping criteria
# For simplicity, just stop training after certain no. of epochs.
num_epochs = 5000
losses = [] # Keeps track of the loses.
# Step 2-4 of training routine.
for _e in tqdm(range(num_epochs)):
# Reset the gradient after every epoch.
# Step 2: Foward Propagation
predictions = model(X)
# Step 3: Back Propagation
# Calculate the cost between the predictions and the truth.
loss = criterion(predictions, Y)
# Remember to back propagate the loss you've computed above.
# Step 4: Optimizer take a step and update the weights.
# Log the loss value as we proceed through the epochs.
---> 60 layer1_error =, W2.T) ..... ValueError: shapes (4,50) and (1,5) not aligned: 50 (dim 1) != 1 (dim 0)
– cs950.0
, right? Doesn't that imply that something is wrong with the PyTorch code? @coldspeed I'm able to reproduce the OP's results from the from-scratch code. It seems like somehowlayer2_delta
is ending up with shape(4, 50)
when you run it (for melayer2_delta.shape
). – tel