How to set Target in the LSTM for video classification

Question

Please see the following code for creating an LSTM network:

NumberofClasses=8
model = Sequential()
model.add(LSTM(256,dropout=0.2,input_shape=(32,
                 512),return_sequences=False))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(NumberofClasses, activation='softmax'))
print(model.summary())
sgd = SGD(lr=0.00005, decay = 1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics= 
['accuracy'])

callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=1), 
ModelCheckpoint('video_1_LSTM_1_1024.h5', monitor='val_loss', 
save_best_only=True, verbose=1 ) ]
nb_epoch = 500
model.fit(train_data,train_labels,validation_data,validation_labels,batch_size,nb_epoch,callbacks,shuffle=False,verbose=1)

In the above code, I am creating an LSTM using Keras library of python, my data has a sample of 131 videos belonging to 8 different classes. I have set a frame sequence of 32 frames for each video(thus each video has 32 frames and hence 131 video generated 4192 frames) I extracted features from a pre-trained model of VGG16 for each of these frames. I created train dataset through adding Each of these extracted features into an array. it generated a final array of 4192,512 dimensions. the corresponding train_labels holds one hot encoding for each of the eight classes and have a dimension of 4192,8. However since LSTM needs the input shape of (samples, timestamp, and feature) formate, and each of video in my case has a sequence of 32 frames, so I reshaped the trained data in to [131,32,512] and applied the same reshaping to the train_labels. However, when I run this I got the following error:

 ValueError: Error when checking target: expected dense_2 to have 2 dimensions, but got 
 array with shape (131, 32, 8)

If I do not reshape the train_labels and leave it like (4192,8) the error is :

 ValueError: Input arrays should have the same number of samples as target 
 arrays. Found 131 input samples and 4192 target samples.

please note that since each of my videos has 32 frame sequence length that i applied this reshaping [131,32,512] to train data and (131, 32, 8) to corresponding labels. I would appreciate any comment or advice to solve this problem

Dr. Snoopy Dr. Snoopy · Accepted Answer · 2019-06-10T09:31:29

In video classification you usually have one label for the whole video, meaning that in your case the labels should have shape (131, 8).

If you have labels as (131, 32, 8), this means you have 131 samples, each sample has 32 timesteps, and each timestep has 8 classes, so in this case there is one label for each timestep, and that is not video classification. A model can do this but you need some changes to the LSTM in order for this to work.

If you want to classify each timestep, then you should use return_sequences=True in your LSTM, like:

model = Sequential()
model.add(LSTM(256,dropout=0.2,input_shape=(32,
                 512),return_sequences=True))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(NumberofClasses, activation='softmax'))

You can check how the output shape of the model has changed with model.summary()

How to set Target in the LSTM for video classification

1 Answers