4
votes

Let's suppose that we have a 3D PyTorch tensor, where the first dimension represents the batch_size, as follows:

import torch
import torch.nn as nn
x = torch.randn(32, 100, 25)

That is, for each i, x[i] is a set of 100 25-dimensional vectors. I would like to compute the similarity (e.g., the cosine similarity -- but in general any such pairwise distance/similarity matrix) of these vectors for each batch item.

That is, for each x[i] I need to compute a [100, 100] matrix which will contain the pairwise similarities of the above vectors. More specifically, the (i,j)-th element of this matrix should contain the similarity (or the distance) between the i-th and the j-th row of (the 100x25) x[t], for all t=1, ..., batch_size.

If I use torch.nn.CosineSimilarity(), no matter what dim I'm using, the result is either [100, 25] (dim=0), or [32, 25] (dim=1) , where I need a tensor of size [32, 100, 100]. I would expect torch.nn.CosineSimilarity() to work this way (since, at least to me, it looks more intuitive), but it doesn't.

Could that be done using something like below?

torch.matmul(x, x.permute(0, 2, 1))

I guess that this could give a distance matrix, but what if I need an arbitrary pairwise operation? Should I build this operation using the above?

Or maybe should I repeat x in a way so I can use the built-in torch.nn.CosineSimilarity()?

Thank you.

4

4 Answers

5
votes

The documentation implies that the shapes of the inputs to cosine_similarity must be equal but this is not the case. Internally PyTorch broadcasts via torch.mul, inserting a dimension with a slice (or torch.unsqueeze) will give you the desired result. This is not optimal due to duplicate computations and memory for the upper and lower triangles but it's simple:

import torch
from torch.nn import functional as F
from scipy.spatial import distance

# compute once in pytorch
x = torch.randn(32, 100, 25)
y = F.cosine_similarity(x[..., None, :, :], x[..., :, None, :], dim=-1)

assert y.shape == torch.Size([32, 100, 100])

# test against scipy by iterating over each batch element
z = []
for i in range(x.shape[0]):
    slice = x[i, ...].numpy()
    z.append(torch.tensor(distance.cdist(slice, slice, metric='cosine'), dtype=torch.float32))

# convert similarity to distance and ensure they're reasonably close
assert torch.allclose(torch.stack(z), 1.0-y)

2
votes

If you carefully read the documentation of nn.CosineSimilarity and nn.PairwiseDistance you'll see that they do not compute all pair-wise similarities/distances (as you require), but rather expects two inputs of the same shape, and compute the similarities/distances between all corresponding points only.
That is, if you have two sets of 100 vectors in 32 dimensions, what these function will compute is the similarity/distance between the ith vector in the first set the the corresponding ith vector in the second set, resulting with only 100 similarity/distance values.

If you want to compute all pair-wise distances, you'll need to manually compute them.
Using torch.matmul seems like a step in the right direction.

If you are looking for an efficient way of computing L2 distances, you might find the method in this answer useful.

1
votes

A partial answer to the question, for the case of Euclidean distance/similarity (or more generally any p-norm distance), but NOT for Cosine similarity:

Use the torch.cdist, which "Computes batched the p-norm distance between each pair of the two collections of row vectors".

0
votes

This should do it:

import torch.nn.functional as F
x = torch.randn(32, 100, 25)

# cosine similarity: normalize and multiply
cos = lambda m: F.normalize(m) @ F.normalize(m).t()
torch.stack([cos(m) for m in x])  # [32, 100, 100]

Note: this is the cosine similarity implementation in sentence-transformers