0
votes

Training with a softmax output layer for my generative neural network gives better results than with relu overall but relu gives me the sparsity I need (zeros in pixels). Softmax also helps get a normalised output (i.e. sum =1.).

I want to do:

outputs = Dense(200, activation='softmax', activity_regularizer=l1(1e-5))(x)
outputs = Activation('relu')(outputs) # to get real zeros
outputs = Activation('softmax')(outputs) # still real zeros, normalized output

But by applying successive softmax I will get extreme outputs. Is there a layer I can use instead which just normalizes the output to 1 (output_i/sum(output)) instead of softmax ?

2

2 Answers

2
votes

You don't need to add two softmax. Just the last one is fine:

outputs = Dense(200, activation='relu', activity_regularizer=l1(1e-5))(x)
outputs = Activation('softmax')(outputs) # still real zeros, normalized 

Yet, if you have more intermediate layers and you want them to behave more moderately, you could use a "tanh" instead of softmax.

Often the problem with relu models is not exactly "they don't sum 1", but simply "their values are way to high, gradients can't behave well".

#this combines a max output of 1 (but doesn't care about the sum)
#yet keeping the sparsity:
outputs = Dense(200, activation='tanh')(x)
outputs = Activation('relu')(outputs) # to get real zeros

outputs = Dense(200, activation='relu')(outputs)

#this should only be used at the final layer
#and only if you really have a classification model with only one correct class
outputs = Activation('softmax')(outputs) # still real zeros, normalized output

Softmax tends to favor only one of the results. If you don't want to change how results compare one to another and yet you want to make sum=1, you can go for @nuric's answer.

1
votes

You can write your own layer to convert the outputs to unit norm (ie normalise in your case) without applying a softmax. You can achieve by converting the output to a unit vector. Something along the lines of:

def unitnorm(x):
  return x / (K.epsilon() + K.sqrt(K.sum(K.square(x), keepdims=True)))
# Wrap Lambda layer
outputs = Lambda(unitnorm, name='unitnorm')(outputs)

The code is from unit norm constraint which does the same for kernels and bias in layers. You can try without epsilon to be more precise but could be less stable when you have a lot of zeros.