I am trying to find the fastest way to perform the following pairwise distance calculation in Python. I want to use the distances to rank a list_of_objects
by their similarity.
Each item in the list_of_objects
is characterised by four measurements a, b, c, d, which are made on very different scales e.g.:
object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]
The aim is to get a pairwise distance matrix of the objects in list_of_objects
. However, I want to be able to specify the 'relative importance' of each measurement in my distance calculation via a weights vector with one weight per measurement, e.g.:
weights = [1, 1, 1, 1]
would indicate that all measurements are equally weighted. In this case I want each measurement to contribute equally to the distance between objects, regardless of the measurement scale. Alternatively:
weights = [1, 1, 1, 10]
would indicate that I want measurement d to contribute 10x more than the other measurements to the distance between objects.
My current algorithm looks like this:
- Calculate a pairwise distance matrix for each measurement
- Normalise each distance matrix so that the maximum is 1
- Multiply each distance matrix by the appropriate weight from
weights
- Sum the distance matrices to generate a single pairwise matrix
- Use the matrix from 4 to provide a ranked list of pairs of objects from
list_of_objects
This works fine, and gives me a weighted version of the city-block distance between objects.
I have two questions:
Without changing the algorithm, what's the fastest implementation in SciPy, NumPy or SciKit-Learn to perform the initial distance matrix calculations.
Is there an existing multi-dimensional distance approach that does all of this for me?
For Q 2, I have looked, but couldn't find anything with a built-in step that does the 'relative importance' in the way that I want.
Other suggestions welcome. Happy to clarify if I've missed details.