0
votes

I am trying to perform a task of approximation of two embeddings (textual and visual). For the visual embedding, I am using VGG as the encoder. The output is a 1x1000 embedding. For the textual encoder, I am using a Transformer to which output is shaped 1x712. What I want is to convert both these vectors to the same dimension 512.

img_features.shape, txt_features.shape = (1,1000),(1,712)

How can I do it in PyTorch? Add a final layer in each architecture that models the output to 512?

1

1 Answers

1
votes
  • You could either apply a differentiable PCA operator such as torch.pca_lowrank.

  • Alternatively, an easier solution is to use two fully connected adapter layers to learn two mappings. One for you image features 1000 -> n, the other for textual features: 712 -> n. Then you can choose a fusion strategy to combine the two features shaped (1,n): either using concatenation, point-wise addition/multiplication (in thoses cases n should be equal to 512. Esle you can learn a final mapping n*2 -> 512.