How to correctly use Cross Entropy Loss vs Softmax for classification?

Question

I want to train a multi class classifier using Pytorch.

Following the official Pytorch doc shows how to use a nn.CrossEntropyLoss() after a last layer of type nn.Linear(84, 10).

However, I remember this is what Softmax does.

This leaves me confused.

How to train a "standard" classification network in the best way?
If the network has a final linear layer, how to infer the probabilities per class?
If the network has a final softmax layer, how to train the network (which loss, and how)?

I found this thread on the Pytorch forum, which likely answers all that, but I couldn't compile it into working and readable Pytorch code.

My assumed answers:

Like the doc says.
Exponentiation of the outputs of the linear layer, which are really logits (log probalbilities).
I don't understand.

jodag jodag · Accepted Answer · 2020-12-22T14:31:17

I think that it's important to understand softmax and cross-entropy, at least from a practical point of view. Once you have a grasp on these two concepts then it should be clear how they may be "correctly" used in the context of ML.

Cross Entropy H(p, q)

Cross-entropy is a function that compares two probability distributions. From a practical standpoint it's probably not worth getting into the formal motivation of cross-entropy, though if you're interested I would recommend Elements of Information Theory by Cover and Thomas as an introductory text. This concept is introduced pretty early on (chapter 2 I believe). This is the intro text I used in grad school and I thought it did a very good job (granted I had a wonderful instructor as well).

The key thing to pay attention to is that cross-entropy is a function that takes, as input, two probability distributions: q and p and returns a value that is minimal when q and p are equal. q represents an estimated distribution, and p represents a true distribution.

In the context of ML classification we know the actual label of the training data, so the true/target distribution, p, has a probability of 1 for the true label and 0 elsewhere, i.e. p is a one-hot vector.

On the other hand, the estimated distribution (output of a model), q, generally contains some uncertainty, so the probability of any class in q will be between 0 and 1. By training a system to minimize cross entropy we are telling the system that we want it to try and make the estimated distribution as close as it can to the true distribution. Therefore, the class that your model thinks is most likely is the class corresponding to the highest value of q.

Softmax

Again, there are some complicated statistical ways to interpret softmax that we won't discuss here. The key thing from a practical standpoint is that softmax is a function that takes a list of unbounded values as input, and outputs a valid probability mass function with the relative ordering maintained. It's important to stress the second point about relative ordering. This implies that the maximum element in the input to softmax corresponds to the maximum element in the output of softmax.

Consider a softmax activated model trained to minimize cross-entropy. In this case, prior to softmax, the model's goal is to produce the highest value possible for the correct label and the lowest value possible for the incorrect label.

CrossEntropyLoss in PyTorch

The definition of CrossEntropyLoss in PyTorch is a combination of softmax and cross-entropy. Specifically

CrossEntropyLoss(x, y) := H(one_hot(y), softmax(x))

Note that one_hot is a function that takes an index y, and expands it into a one-hot vector.

Equivalently you can formulate CrossEntropyLoss as a combination of LogSoftmax and negative log-likelihood loss (i.e. NLLLoss in PyTorch)

LogSoftmax(x) := ln(softmax(x))

CrossEntropyLoss(x, y) := NLLLoss(LogSoftmax(x), y)

Due to the exponentiation in softmax, there are some computational "tricks" that make directly using CrossEntropyLoss more stable (more accurate, less likely to get NaN), than computing it in stages.

Conclusion

Based on the above discussion the answers to your questions are

1. How to train a "standard" classification network in the best way?

Like the doc says.

2. If the network has a final linear layer, how to infer the probabilities per class?

Apply softmax to the output of the network to infer the probabilities per class. If the goal is to just find the relative ordering or highest probability class then just apply argsort or argmax to the output directly (since softmax maintains relative ordering).

3. If the network has a final softmax layer, how to train the network (which loss, and how)?

Generally, you don't want to train a network that outputs softmaxed outputs for stability reasons mentioned above.

That said, if you absolutely needed to for some reason, you would take the log of the outputs and provide them to NLLLoss

criterion = nn.NLLLoss()
...
x = model(data)    # assuming the output of the model is softmax activated
loss = criterion(torch.log(x), y)

which is mathematically equivalent to using CrossEntropyLoss with a model that does not use softmax activation.

criterion = nn.CrossEntropyLoss()
...
x = model(data)    # assuming the output of the model is NOT softmax activated
loss = criterion(x, y)