2
votes

I have started recently with ML and TensorFlow. While going through the CIFAR10-tutorial on the website I came across a paragraph which is a bit confusing to me:

The usual method for training a network to perform N-way classification is multinomial logistic regression, aka. softmax regression. Softmax regression applies a softmax nonlinearity to the output of the network and calculates the cross-entropy between the normalized predictions and a 1-hot encoding of the label. For regularization, we also apply the usual weight decay losses to all learned variables. The objective function for the model is the sum of the cross entropy loss and all these weight decay terms, as returned by the loss() function.

I have read a few answers on what is weight decay on the forum and I can say that it is used for the purpose of regularization so that values of weights can be calculated to get the minimum losses and higher accuracy.

Now in the text above I understand that the loss() is made of cross-entropy loss(which is the difference in prediction and correct label values) and weight decay loss.

I am clear on cross entropy loss but what is this weight decay loss and why not just weight decay? How is this loss being calculated?

3

3 Answers

7
votes

Weight decay is nothing but L2 regularisation of the weights, which can be achieved using tf.nn.l2_loss.

The loss function with regularisation is given by:

enter image description here

The second term of the above equation defines the L2-regularization of the weights (theta). It is generally added to avoid overfitting. This penalises peaky weights and makes sure that all the inputs are considered. (Few peaky weights means only those inputs connected to it are considered for decision making.)

During gradient descent parameter update, the above L2 regularization ultimately means that every weight is decayed linearly: W_new = (1 - lambda)* W_old + alpha*delta_J/delta_w. Thats why its generally called Weight decay.

0
votes

Weight decay loss, because it adds to the cost function (the loss to be specific). Parameters are optimized from the loss. Using weight decay you want the effect to be visible to the entire network through the loss function. TF L2 loss

Cost = Model_Loss(W) + decay_factor*L2_loss(W)
# In tensorflow it bascially computes half L2 norm
L2_loss = sum(W ** 2) / 2
0
votes

What your tutorial is trying to say by "weight decay loss" is that compared to the cross-entropy cost you know from your unregularized models (i.e. how far off target were your model's predictions on training data), your new cost function penalizes not only prediction error but also the magnitude of the weights in your network. Whereas before you were optimizing only for correct prediction of the labels in your training set, now you are optimizing for correct label prediction as well as having small weights. The reason for this modification is that when a machine learning model trained by gradient descent yields large weights, it is likely they were arrived at in response to peculiarities (or, noise) in the training data. The model will not perform as well when exposed to held-out test data because it is overfit to the training set. The result of applying weight decay loss, more commonly called L2-regularization is that accuracy on training data will drop a bit but accuracy on test data can jump dramatically. And that's what you're after in the end: a model that generalizes well to data it did not see during training. So you can get a firmer grasp on the mechanics of weight decay, let's look at the learning rule for weights in a L2-regularized network:

enter image description here

where eta and lambda are user-defined learning rate and regularization parameter, respectively and n is the number of training examples (you'll have to look up those Greek letters if you're not familiar). Since the values eta and (eta*lambda)/n both are constants for a given iteration of training, it's enough to interpret the learning rule for weight decay as "for a given weight, subract a small multiple of the derivative of the cost function with respect to that weight, and subtract a small multiple of the weight itself."

Let's look at four weights in an imaginary network and how the above learning rule affects them. As you can see, the regularization term shown in red pushes weights toward zero no matter what. It is designed to minimize the magnitude of the weight matrix, which it does by minimizing the absolute values of individual weights. Some key things to notice in these plots:

  1. When the sign of the cost derivative and the sign are the weight are the same, the regularization term accelerates the weight's path to its optimum!
  2. The amount that the regularization term affects the weight update is proportional to the current value of that weight. I've shown this in the plots with tiny red arrows showing contributions of weights with current values close to zero, and larger red arrows for weights with larger current magnitudes.

enter image description here