Unable to Learn XOR Representation using 2 layers of Multi-Layered Perceptron (MLP)

Question

Using PyTorch nn.Sequential model, I'm unable to learn all four representation of the XOR booleans:

import numpy as np

import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim

use_cuda = torch.cuda.is_available()

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Converting the X to PyTorch-able data structure.
X_pt = Variable(FloatTensor(X))
X_pt = X_pt.cuda() if use_cuda else X_pt
# Converting the Y to PyTorch-able data structure.
Y_pt = Variable(FloatTensor(Y), requires_grad=False)
Y_pt = Y_pt.cuda() if use_cuda else Y_pt

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000

for _ in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    loss_this_epoch.backward()
    optimizer.step()
    print([int(_pred > 0.5) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

After learning:

for _x, _y in zip(X_pt, Y_pt):
    prediction = model(_x)
    print('Input:\t', list(map(int, _x)))
    print('Pred:\t', int(prediction))
    print('Ouput:\t', int(_y))
    print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    0
Ouput:   1
######
Input:   [1, 1]
Pred:    0
Ouput:   0
######

I've tried running the same code over a couple of random seeds but it didn't manage to learn all for XOR representation.

Without PyTorch, I could easily train a model with self-defined derivative functions and manually perform the backpropagation, see https://www.kaggle.io/svf/2342536/635025ecf1de59b71ea4fa03eb84f9f9/results.html#After-some-enlightenment

Why is it that the 2-layered MLP using PyTorch didn't learn the XOR representation?

How is the model in PyTorch:

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

different from the one that is hand-written with the derivatives and the manually written backpropagation and optimizer step from https://www.kaggle.com/alvations/xor-with-mlp ?

Are the same the one hidden layered perceptron network?

Updated

Strangely, adding a nn.Sigmoid() between the nn.Linear layers didn't work:

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000

for _ in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    loss_this_epoch.backward()
    optimizer.step()

for _x, _y in zip(X_pt, Y_pt):
    prediction = model(_x)
    print('Input:\t', list(map(int, _x)))
    print('Pred:\t', int(prediction))
    print('Ouput:\t', int(_y))
    print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    1
Ouput:   1
######
Input:   [1, 1]
Pred:    1
Ouput:   0
######

But adding nn.ReLU() did:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.ReLU(), 
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

...
for _x, _y in zip(X_pt, Y_pt):
prediction = model(_x)
print('Input:\t', list(map(int, _x)))
print('Pred:\t', int(prediction))
print('Ouput:\t', int(_y))
print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    1
Ouput:   1
######
Input:   [1, 1]
Pred:    1
Ouput:   0
######

Isn't a sigmoid enough for the non-linear activation?

I understand that the ReLU fits the task of boolean output but shouldn't the Sigmoid function produce the same/similar effect?

UPDATED 2

Running the same training 100 times:

from collections import Counter 
import random
random.seed(100)

import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim
use_cuda = torch.cuda.is_available()


all_results=[]

for _ in range(100):
    hidden_dim = 2

    model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                          nn.ReLU(), # Does the sigmoid has a build in biased? 
                          nn.Linear(hidden_dim, output_dim),
                          nn.Sigmoid())

    criterion = nn.MSELoss()
    learning_rate = 0.03
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    num_epochs = 3000

    for _ in range(num_epochs):
        predictions = model(X_pt)
        loss_this_epoch = criterion(predictions, Y_pt)
        loss_this_epoch.backward()
        optimizer.step()
        ##print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

    x_pred = [int(model(_x)) for _x in X_pt]
    y_truth = list([int(_y[0]) for _y in Y_pt])
    all_results.append([x_pred == y_truth, x_pred, loss_this_epoch.data[0]])


tf, outputsss, losses__ = zip(*all_results)
print(Counter(tf))

It only managed to learn the XOR representation 18 out of 100 times... -_-|||

I haven't used PyTorch before, but one thing that jumps out at me is the architecture of your MLP. You're using linear activations in your hidden layers. The XOR problem can't be solved linearly though. You could try switching your hidden layers to ReLU, Sigmoid, or one of the other non-linear activations. — Scratch'N'Purr
Also, because your problem is a classification problem, I would change the loss function to Cross Entropy or Negative Log Likelihood — Scratch'N'Purr
Again, I hesitate to make a formal answer since I'm haven't used pyTorch, but I think you want to build the architecture as model = nn.Sequential(nn.Sigmoid(), nnSigmoid(), nnSigmoid()), with a criterion = nn.CrossEntropyLoss() or criterion = nn.NLLLoss(). Having Linear layers won't do anything to help the model since you're just applying a weight to a linear line, which subsequently alters the weights downstream in the Sigmoid layer. — Scratch'N'Purr
I think Linear here refers to fully connected layers, the idea is to learn the hidden dimensions that can separate the exclusive 1/0 or 0/1->1 gate and then an OR gate through the hidden dimensions. Figure 2 of pdfs.semanticscholar.org/51ec/… — alvas

wesolyromek wesolyromek · Accepted Answer · 2018-02-06T01:23:44

It's because nn.Linear has no activation built in, so your model is effectively a linear classifier, and XOR is the canonical example of a problem that can't be solved using linear classifiers.

Change this:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

to that:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

and only then will your model be equivalent to the one from the linked Kaggle notebook.

Unable to Learn XOR Representation using 2 layers of Multi-Layered Perceptron (MLP)

Updated

UPDATED 2

4 Answers