3
votes

Using PyTorch nn.Sequential model, I'm unable to learn all four representation of the XOR booleans:

import numpy as np

import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim

use_cuda = torch.cuda.is_available()

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Converting the X to PyTorch-able data structure.
X_pt = Variable(FloatTensor(X))
X_pt = X_pt.cuda() if use_cuda else X_pt
# Converting the Y to PyTorch-able data structure.
Y_pt = Variable(FloatTensor(Y), requires_grad=False)
Y_pt = Y_pt.cuda() if use_cuda else Y_pt

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000

for _ in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    loss_this_epoch.backward()
    optimizer.step()
    print([int(_pred > 0.5) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

After learning:

for _x, _y in zip(X_pt, Y_pt):
    prediction = model(_x)
    print('Input:\t', list(map(int, _x)))
    print('Pred:\t', int(prediction))
    print('Ouput:\t', int(_y))
    print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    0
Ouput:   1
######
Input:   [1, 1]
Pred:    0
Ouput:   0
######

I've tried running the same code over a couple of random seeds but it didn't manage to learn all for XOR representation.

Without PyTorch, I could easily train a model with self-defined derivative functions and manually perform the backpropagation, see https://www.kaggle.io/svf/2342536/635025ecf1de59b71ea4fa03eb84f9f9/results.html#After-some-enlightenment

Why is it that the 2-layered MLP using PyTorch didn't learn the XOR representation?


How is the model in PyTorch:

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

different from the one that is hand-written with the derivatives and the manually written backpropagation and optimizer step from https://www.kaggle.com/alvations/xor-with-mlp ?

Are the same the one hidden layered perceptron network?


Updated

Strangely, adding a nn.Sigmoid() between the nn.Linear layers didn't work:

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000

for _ in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    loss_this_epoch.backward()
    optimizer.step()

for _x, _y in zip(X_pt, Y_pt):
    prediction = model(_x)
    print('Input:\t', list(map(int, _x)))
    print('Pred:\t', int(prediction))
    print('Ouput:\t', int(_y))
    print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    1
Ouput:   1
######
Input:   [1, 1]
Pred:    1
Ouput:   0
######

But adding nn.ReLU() did:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.ReLU(), 
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

...
for _x, _y in zip(X_pt, Y_pt):
prediction = model(_x)
print('Input:\t', list(map(int, _x)))
print('Pred:\t', int(prediction))
print('Ouput:\t', int(_y))
print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    1
Ouput:   1
######
Input:   [1, 1]
Pred:    1
Ouput:   0
######

Isn't a sigmoid enough for the non-linear activation?

I understand that the ReLU fits the task of boolean output but shouldn't the Sigmoid function produce the same/similar effect?


UPDATED 2

Running the same training 100 times:

from collections import Counter 
import random
random.seed(100)

import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim
use_cuda = torch.cuda.is_available()


all_results=[]

for _ in range(100):
    hidden_dim = 2

    model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                          nn.ReLU(), # Does the sigmoid has a build in biased? 
                          nn.Linear(hidden_dim, output_dim),
                          nn.Sigmoid())

    criterion = nn.MSELoss()
    learning_rate = 0.03
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    num_epochs = 3000

    for _ in range(num_epochs):
        predictions = model(X_pt)
        loss_this_epoch = criterion(predictions, Y_pt)
        loss_this_epoch.backward()
        optimizer.step()
        ##print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

    x_pred = [int(model(_x)) for _x in X_pt]
    y_truth = list([int(_y[0]) for _y in Y_pt])
    all_results.append([x_pred == y_truth, x_pred, loss_this_epoch.data[0]])


tf, outputsss, losses__ = zip(*all_results)
print(Counter(tf))

It only managed to learn the XOR representation 18 out of 100 times... -_-|||

4
I haven't used PyTorch before, but one thing that jumps out at me is the architecture of your MLP. You're using linear activations in your hidden layers. The XOR problem can't be solved linearly though. You could try switching your hidden layers to ReLU, Sigmoid, or one of the other non-linear activations.Scratch'N'Purr
Also, because your problem is a classification problem, I would change the loss function to Cross Entropy or Negative Log LikelihoodScratch'N'Purr
Is there a "pre-supposed" sigmoid between the nn.Linear?alvas
Again, I hesitate to make a formal answer since I'm haven't used pyTorch, but I think you want to build the architecture as model = nn.Sequential(nn.Sigmoid(), nnSigmoid(), nnSigmoid()), with a criterion = nn.CrossEntropyLoss() or criterion = nn.NLLLoss(). Having Linear layers won't do anything to help the model since you're just applying a weight to a linear line, which subsequently alters the weights downstream in the Sigmoid layer.Scratch'N'Purr
I think Linear here refers to fully connected layers, the idea is to learn the hidden dimensions that can separate the exclusive 1/0 or 0/1->1 gate and then an OR gate through the hidden dimensions. Figure 2 of pdfs.semanticscholar.org/51ec/…alvas

4 Answers

5
votes

It's because nn.Linear has no activation built in, so your model is effectively a linear classifier, and XOR is the canonical example of a problem that can't be solved using linear classifiers.

Change this:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

to that:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

and only then will your model be equivalent to the one from the linked Kaggle notebook.

0
votes

You are almost there with your 2nd update. Here's a notebook with a working solution: https://colab.research.google.com/github/osipov/edu/blob/master/misc/xor.ipynb

Your mistake is to use sigmoid after the last linear layer which makes it difficult for the optimizer to converge to the 0 and 1 values expected in your training dataset. Recall that sigmoid approaches 0 and 1 at negative and positive infinities respectively.

So, your implementation (assuming PyTorch 1.7) should be

import torch as pt
from torch.nn.functional import mse_loss
pt.manual_seed(33);

model = pt.nn.Sequential(
    pt.nn.Linear(2, 5),
    pt.nn.ReLU(),
    pt.nn.Linear(5, 1)
)

X = pt.tensor([[0, 0],
               [0, 1],
               [1, 0],
               [1, 1]], dtype=pt.float32)

y = pt.tensor([0, 1, 1, 0], dtype=pt.float32).reshape(X.shape[0], 1)

EPOCHS = 100

optimizer = pt.optim.Adam(model.parameters(), lr = 0.03)

for epoch in range(EPOCHS):
  #forward
  y_est = model(X)
  
  #compute mean squared error loss
  loss = mse_loss(y_est, y)

  #backprop the loss gradients
  loss.backward()

  #update the model weights using the gradients
  optimizer.step()

  #empty the gradients for the next iteration
  optimizer.zero_grad()

which after execution trains the model, so that

model(X).round().abs()

returns

tensor([[0.],
        [1.],
        [1.],
        [0.]], grad_fn=<AbsBackward>)

which is the correct output.

-1
votes

Here are a few simple changes to your code that should help put you on a better path. I've used ReLU activation functions internally, but sigmoid will also work if used correctly. Also, if you want to try using the SGD optimizer you may want to turn down the learning rate by an order of magnitude or so.

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),    
                      nn.ReLU(),       
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
if use_cuda:
  model.cuda()

criterion = nn.BCELoss()
#criterion = nn.L1Loss()
#learning_rate = 0.03
#optimizer = optim.SGD(model.parameters(), lr=learning_rate)
optimizer = optim.Adam(model.parameters())
num_epochs = 10000


for epoch in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    model.zero_grad()
    loss_this_epoch.backward()
    optimizer.step()
    if epoch%1000 == 0: 
      print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])
-1
votes

With the sigmoid between layers and at the end, the most important thing to consider is to update the weights in a purely stochastic way, i.e., update after every single sample, and pick at every iteration a sample randomly.

When respecting this, and when using a large learning rate (around 1.0), I've observed that the model is usually learning fine the XOR with a standard 2 layers pytorch implementation (2-2-1 layers size), with standard weights initialization, without regularization.