I am trying to perform a task of approximation of two embeddings (textual and visual).
For the visual embedding, I am using VGG as the encoder. The output is a 1x1000 embedding. For the textual encoder, I am using a Transformer to which output is shaped 1x712. What I want is to convert both these vectors to the same dimension 512.
img_features.shape, txt_features.shape = (1,1000),(1,712)
How can I do it in PyTorch? Add a final layer in each architecture that models the output to 512?