5
votes

I am currently implementing a custom loss layer and in the process, I stumbled upon the implementation of mean squared error in the objectives.py file [1]. I know I'm missing something in my understanding of this loss calculation because I always thought that the average was done separately across the samples for each output in each mini-batch (axis 0 of the tensor) but it appears that the average is actually being done across the last axis, which in a single vector, would mean it's being done across the outputs. I found this by accident while working on my custom loss layer because it requires discounting the loss of a few of the outputs it a training output in a specific place is a specific value. Anyways, is my understanding of the mean squared error incorrect? Why would Keras be using the last axis and thus turning a a 1xn output vector into a 1x1 output vector?

Thanks.

[1] https://github.com/fchollet/keras/blob/master/keras/objectives.py#L7

3
What do you think K.mean means? :)Dr. Snoopy
Sorry- I adjusted my question. I meant that I didn't see where the squaring was happening, not the mean.Corey J. Nolet
That would be K.squareDr. Snoopy
Did you read my whole question?Corey J. Nolet
Yes, but in any case there are multiple questions here, I was just pointing out one.Dr. Snoopy

3 Answers

9
votes

The code in question for the MSE loss is this:

def mean_squared_error(y_true, y_pred):
    return K.mean(K.square(y_pred - y_true), axis=-1)

Here first y_pred and y_true are subtracted, then that result is passed to K.square, which as expected, returns the square of its parameter, and then that result is given to K.mean, which computes the mean.

So the code clearly is doing what its supposed to do. About why the last axis is operated upon, this has nothing to do with classes, it is just a convention. Note that in general, there are no classes in the MSE definition.

3
votes

Let's detail the steps of how the losses are computed in Keras to show that the axis=-1 in all the loss computations are correct :

  • So we pick a loss in losses.py that we will pass to the compile method of our model.

  • In compile, the total loss is computed. It happens in several steps : The first step creates a list of losses, one for each output of the model.

  • This first step calls _weighted_masked_objective which according to the docs 'Adds support for masking and sample-weighting to an objective function'
  • Basically, _weighted_masked_objective returns a new objective functions which take into account the weights and mask parameters which the user will provide when using the method fit.

If I cut the code to have only the lines that matter for the question, we get to something like that.

def _weighted_masked_objective(fn):
    def weighted(y_true, y_pred, weights, mask=None):
          score_array = fn(y_true, y_pred) # Compute loss as in losses.py
          return K.mean(score_array) # Average over all axis

class Model(Container):
    def compile(self, optimizer, loss, metrics=None, loss_weights=None,
                sample_weight_mode=None, weighted_metrics=None,
                target_tensors=None, **kwargs):
        weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]

So at the end, the loss is indeed averaged over every dimension, and the use of axis=-1 is just an elegant way to enable masking and weighting of the loss at another point in the code

NB : I didn't explain the other steps because they don't contribute to answering the question.

2
votes

I believe, after some conversations with coworkers, that I understand this situation and have a proper solution to the problem. Though I knew that Theano was providing lazy-evaluated tensor functions that were running the matrix operations on the GPU, what I did not realize was that Keras's loss functions are actually written in a way where the compiled theano execution graph is smart enough to cache certain values in order to properly back-propagate the loss values back throughout the network. Because of the type of network I'm creating, I dived into writing my own custom loss function without a completely understanding of how Theano actually treats the loss after it's been calculated by the function.

From what I can tell, my concern was correct that Keras' use of the last axis is a problem. In my case, I have a fully-convolutional deep neural network and the input to the loss function is (x, 7, 16, 16) where the x is the size of the mini-batch. Normally, neural networks output a matrix where the first dimension is the mini-batch size and the second (usually last) dimension is the actual size of the output vector. Because of this, using the last axis in the output tensor to do the actual "mean" portion of the mean squared error is not correct. Instead, the axis should be 1 (in the case of zero-based indexing) because it's the 7 actual regression output features that need to be differentiated for back-propagation.

I originally knew that the axis = -1 may not be correct and the reason I posted this question was because I couldn't quite explain why. It's been a long time since I've had to dive into the math behind the neural networks but when I finally did, I was able to resolve the gaps (I think). I'm posting this response here for future people who may experience this same problem or gap in their understanding of Theano's tensor framework.