LSTM autoencoder on sequences - what loss function?

Question

I'm trying to construct an LSTM autoencoder on sequences of text (titles of web articles) mostly by reproducing the mentioned basic example in https://blog.keras.io/building-autoencoders-in-keras.html. The input are 80 (maximum title length) one-hot vectors of length 40 (number of ascii characters in dataset). The output, against which the predictions are checked, is the same as input, because it is an autoencoder. I have about 60k sequences for testing the model, but ultimately, I would like to run it on the whole set of 320k.

Problem

But the problem is, that the LSTM network isn't learning properly at all. For instance, the Czech sentence 'Real vyhrál slavné derby s Barcelonou' gets reproduced as '###uuu...uuu' (dots mean, that the u's continue till the end).

In the tutorial above, it doesn't mention what loss function, activation function or optimizer to use, so I searched and found, that with LSTMs RMSProp optimizer works the best. I tried RELU, Tanh, Softmax etc. as activation functions, though none of them did any better. What I am hesitating most about is the loss function. I thought that binary or categorical cross-entropy would work nicely, but this just might be where I am mistaken. Mean squared error didn't yield any good results either.

My model thus far

input_sentence = Input(shape=(max_title_length, number_of_chars), dtype='int32')
tofloat = Lambda(function=lambda x: tf.to_float(x))(input_sentence)
encoder = LSTM(latent_dim, activation='tanh')(tofloat)

decoder = RepeatVector(max_title_len)(encoder)
decoder = LSTM(number_of_chars, return_sequences=True, activation='tanh')(decoder)
autoencoder = Model(input=input_sentence, output=decoder)

autoencoder.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

So the questions

What loss function would you use?
Does the binary/categorical cross-entropy calculate loss for the entire output matrix or for individual timesteps (rows in the output matrix)? I would like to achieve the latter.
If you think this approach isn't going to work, would you suggest a different one?

This question might be better answered at datascience.stackexchange.com — ZakJ

Fabrício Pereira Fabrício Pereira · Accepted Answer · 2017-07-15T20:32:09

I think that the better loss function for your case is "Hamming loss" that computes the average Hamming loss or Hamming distance between two sets of samples. Thus you can compute distances between all rows from a matrix.

Example with sckit-learn and numpy:

>>> hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))
0.75

LSTM autoencoder on sequences - what loss function?

Problem

My model thus far

So the questions

1 Answers