CNN-> LSTM network for videos

Question

I have X number of videos and each video has a different number of frames, let's say Y(x). Frames size is same for all videos 224X224X3. I am passing each frame to CNN and it outputs a feature vector of 1024. Now, I want to pass it to LSTM. For LSTM batch_size, time_steps , number_of_feature is required. How should I decide those value ? I have two configurations in mind but do not know how should I proceed.

Should I break 1024 into 32 X 32 to define time_steps and number_of_features and batch_size is number of frames
Should time_step should be corresponding to number of frames and number_of_feature should be 1024 and batch_size (?)

Ishant Mrinal Ishant Mrinal · Accepted Answer · 2017-08-18T06:52:51

So it depends on the problem you are trying to solve.

Action classification using videos?

if you are trying to predict the action/event from the video you have to use num_of_frames as time_steps, and batch_size will be number of videos you want to process together.

Per frame object classification ?

In this case you can split the features as 32x32 as time_steps,

CNN-> LSTM network for videos

2 Answers