Yes, you are right with the concept of CONVLSTM2D.
CONVLSTM2D architecture combines gating of LSTM with 2D convolutions.
As you have mentioned, CONVLSTM layers will do a similar task to LSTM but instead of matrix multiplications, it does convolution operations and retains the input dimensions.
Another different approach would be that the images pass through the convolution layer and the result will be a flattened 1D array and this will be the input to the LSTM layers with a set of features over time.
Input of Kera's CONVLSTM layer: is a 5D tensor with shape
(samples, time, channels, rows, cols)
if it is channels first.
(samples, time, rows, cols, channels)
if it is channels last.
Output of a CONVLSTM layer:
If return_sequences = True
then it is a 5D tensor with shape
(samples, time, filters, rows, cols)
If return_sequences = False then it is a 4D tensor with shape.
(samples, filters, rows, cols)
You can refer to this paper from where the implementation of CONVLSTM is done.