22
votes

I'm trying to write a neural Network for binary classification in PyTorch and I'm confused about the loss function.

I see that BCELoss is a common function specifically geared for binary classification. I also see that an output layer of N outputs for N possible classes is standard for general classification. However, for binary classification it seems like it could be either 1 or 2 outputs.

So, should I have 2 outputs (1 for each label) and then convert my 0/1 training labels into [1,0] and [0,1] arrays, or use something like a sigmoid for a single-variable output?

Here are the relevant snippets of code so you can see:

self.outputs = nn.Linear(NETWORK_WIDTH, 2) # 1 or 2 dimensions?


def forward(self, x):
  # other layers omitted
  x = self.outputs(x)           
  return F.log_softmax(x)  # <<< softmax over multiple vars, sigmoid over one, or other?

criterion = nn.BCELoss() # <<< Is this the right function?

net_out = net(data)
loss = criterion(net_out, target) # <<< Should target be an integer label or 1-hot vector?

Thanks in advance.

2

2 Answers

49
votes

For binary outputs you can use 1 output unit, so then:

self.outputs = nn.Linear(NETWORK_WIDTH, 1)

Then you use sigmoid activation to map the values of your output unit to a range between 0 and 1 (of course you need to arrange your training data this way too):

def forward(self, x):
    # other layers omitted
    x = self.outputs(x)           
    return torch.sigmoid(x)  

Finally you can use the torch.nn.BCELoss:

criterion = nn.BCELoss()

net_out = net(data)
loss = criterion(net_out, target)

This should work fine for you.

You can also use torch.nn.BCEWithLogitsLoss, this loss function already includes the sigmoid function so you could leave it out in your forward.

If you, want to use 2 output units, this is also possible. But then you need to use torch.nn.CrossEntropyLoss instead of BCELoss. The Softmax activation is already included in this loss function.


Edit: I just want to emphasize that there is a real difference in doing so. Using 2 output units gives you twice as many weights compared to using 1 output unit.. So these two alternatives are not equivalent.

0
votes

Some theoretical add up:

For binary classification (say class 0 & class 1), the network should have only 1 output unit. Its output will be 1 (for class 1 present or class 0 absent) and 0 (for class 1 absent or class 0 present).

For loss calculation, you should first pass it through sigmoid and then through BinaryCrossEntropy (BCE). Sigmoid transforms the output of the network to probability (between 0 and 1) and BCE then maximizes the likelihood of the desired output.