How on earth can I pass the sequences with different length to an LSTM on keras?

Question

I have a X_train set of 744983 samples divided into 24443 sequences, while the number of samples in each sequence is different. Each sample is a vector of 30 dimensions. How can I feed these data into a LSTM of Keras? Here is some description of the train set :

print(type(X_train))
print(np.shape(X_train))
print(type(X_train[0]))
print(np.shape(X_train[0]))

<class 'list'>
(24443, )
<class 'numpy.ndarray'>
(46, 30)

When I set the parameters in this way :

model = Sequential()
model.add(LSTM(4, input_shape = (30, ), return_sequences=True,))
model.add(Dense(1))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.fit(X_train, y_train, epochs=1, batch_size=1, verbose=2`)

The error is "Input 0 is incompatible with layer lstm_24: expected ndim=3, found ndim=2"

If I change input_shape from (30, ) to (None, 30), the code runs for 1 minute the give the error 'Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 arrays but instead got the following list of 24443 arrays'

Furthermore, if I change X_train into nparrays before fitting, the error turns to : expected lstm_26_input to have 3 dimensions, but got array with shape (24443, 1)

I also tried to pad the sequences :

X_train = sequence.pad_sequences(X_train)
X_test = sequence.pad_sequences(X_test)

However it turned my inputs to '0', '1', '-1' everywhere..

#X_train = np.array(X_train)
#X_test = np.array(X_test)
print(X_train[0])
[[ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]
 ..., 
 [ 0  0  0 ...,  0  1 -1]
 [ 0  0  0 ...,  0  1  0]
 [ 0  0  0 ...,  0  0  0]]

What's X_train.shape ? I find the question a bit confusing, if the problem is with input shapes you should post the inputs shapes, not their content (from which we have no way of retrieving the shape) — gionni
Hi, I used np.split function to produce the 24443 sequences, the type of X_train is 'list', it has no attribute 'shape', but np.shape(X_train) = (24443, ). After I run 'X_train = np.array(X_train)', X_train.shape is (24443, ). After I pad the sequences, X_train.shape = (24443, 124, 30). — Jean-Philippe
What is X_train[0] before you pad the sequences ? If your X_train.shape is (24443, 124, 30), have you tried changing the input_shape parameter for the LSTM to (124, 30) ? — user2969402

Mikhail Stepanov Mikhail Stepanov · Accepted Answer · 2018-08-28T15:54:19

By default, sequence.pad_sequences casts your data into int32 dtype:

tf.keras.preprocessing.sequence.pad_sequences(
    sequences,
    maxlen=None,
    dtype='int32',  # problem is here
    padding='pre',
    truncating='pre',
    value=0.0
)

try to change dtype to the float32:

X_train = sequence.pad_sequences(X_train, dtype='float32')
X_test = sequence.pad_sequences(X_test, dtype='float32')

How on earth can I pass the sequences with different length to an LSTM on keras?

1 Answers