I am trying an autoencoder model with LSTM layers in Keras for text outlier detection. I have encoded every sentence into a sequence of numbers, with each number representing a letter.
So far I have already trained a model with a fixed-length input, by padding zeros to each of the 4000 sequences up to maxlength = 40 thus training the model with a [4000,40,1] shaped array ([batch_size, timesteps, features]).
Now I am wondering how I can use such an autoencoder model without padding zeros to each sequence (sentence) thus training and predicting with the actual size of each sentence (sequence).
At the moment I have standardized every sequence so my train data (x_train) is a list of arrays and every array in list represents a standardized sequence of numbers of different lengths.
To input this data to the LSTM model I am trying to reshape into 3d array with:
x_train=np.reshape(x_train, (len(x_train), 1, 1))
not sure if this is correct though.
My model looks like this (I've removed the input_shape parameter so the model can accept variable-length input):
model = Sequential()
model.add(LSTM(20, activation='tanh',return_sequences=True))
model.add(LSTM(15, activation='tanh', return_sequences=True))
model.add(LSTM(5, activation='tanh', return_sequences=True))
model.add(LSTM(15, activation='tanh', return_sequences=True))
model.add(LSTM(20, activation='tanh', return_sequences=True))
model.add((Dense(1,activation='tanh')))
Then when trying to compile and train the model
nb_epoch = 10
model.compile(optimizer='rmsprop', loss='mse')
checkpointer = ModelCheckpoint(filepath="text_model.h5",
verbose=0,
save_best_only=True)
es_callback = keras.callbacks.EarlyStopping(monitor='val_loss')
history = model.fit(x_train, x_train,
epochs=nb_epoch,
shuffle=True,
validation_data=(x_test, x_test),
verbose=0,
callbacks=[checkpointer,es_callback])
I get error : "ValueError: setting an array element with a sequence."
My model summary is the following:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_6 (LSTM) (None, 1, 20) 1760
_________________________________________________________________
lstm_7 (LSTM) (None, 1, 15) 2160
_________________________________________________________________
lstm_8 (LSTM) (None, 1, 5) 420
_________________________________________________________________
lstm_9 (LSTM) (None, 1, 15) 1260
_________________________________________________________________
lstm_10 (LSTM) (None, 1, 20) 2880
_________________________________________________________________
dense_2 (Dense) (None, 1, 1) 21
=================================================================
Total params: 8,501
Trainable params: 8,501
Non-trainable params: 0
_________________________________________________________________
So my question is if it's possible to train and predict with variable-length input sequence in an LSTM autoencoder model.
And if my thinking process on text outlier detection using such a model architecture is correct.