Implementing a custom distance function

Question

A and B are matrices consisting of binary elements. A is denoted as the base data matrix and B is the query matrix. A consists of 75 data points each of length 10 and B consists of 50 data points each of length 10. I want to calculate the distance between all data points in A and every query data point in B in order to apply nearest neighbor search. So instead of using the Euclidean or the hamming distance, I have used another metric :

N = 2, k = length of data samples, s = A(1,:) and t = B(1,:). The code works for one data sample in A and another data sample in B. How do I scale so that it works for all base data points and all query data points?

Example for which the code works

Let A(1,:) = [1,0,1,1,0,0,0,1,1,0] is the first sample in A matrix. Let B(1,:) = [1,1,0,0,1,1,1,1,0,0] is the first query point.

If the elements in samples taken from A and B are same, 0 is recorded for each similar element, otherwise 1. The final distance is the sum of the 1's. So the program checks to see if two sequences are the same, setting b to 1 if so, or a zero otherwise. Can somebody please show how I can apply this to matrices?

Code

l = length(A);

D=zeros(1,l);
for i=1:l,
    if A(1,i)==B(1,i),
        D(1,i)=0;
    else 
        D(1,i)=1;
    end
end

sum=0;
for j=1:l,
    sum=sum+D(1,j);
end

if sum==0, 
    b = 1;
else 
    b = 0;
end

You mention that your distance metric is not the Hamming distance but the first for loop looks like it is similar... can you clarify? The Hamming distance adds up the total number of disagreeing positions between corresponding elements. However, you are calculating the total number of agreeing positions, which is what your first for loop is doing. Also, are you saying that this code works between two query vectors and you want to extend to matrices? I would like to write a more vectorized approach, but if you are bent in using loops I can live with that. Please clarify. — rayryeng
In the explanation you are saying " If the elements in samples taken from A and B are same, 1 is recorded for each similar element, otherwise zero" but in the code you are doing the opposite — Novice_Developer
Please see the edited question where I have put the formula. The code works between one base vector and one query vector. I am asking how I can modify it to work for 75 base vectors and 50 query vectors. Thank you — SKM

ibezito ibezito · Accepted Answer · 2016-06-09T18:24:05

One line solution

This calculation can be done in a single line of code:

D = A*B'+(1-A)*(1-B)' < size(A,2)

Explanation

Do to the fact that A and B are binary, the distance function between each sample at A and each sample at B basically checks if the amount of per-coordinates matches is equal to a sample's length. You can use matrix multiplication to achieve this.

More descriptive code example

Define A and B as two binary matrices as you mentioned in your answer:

%initializes A and B randomly
A = double(rand(75,10) > 0.5);
B = double(rand(50,10) > 0.5);
[m,n] = size(A);

The distance between each sample in A and each sample in B can be calculated as follows:

First, define a matrix D of size 75x50, s.t D(i,j) is contains the number of matches between the sample i in A and the sample j in B.

It can be calculated as follows:

D = A*B' + (1-A)*(1-B)';

The final distance measure can be done by testing for each pair (i,j) if their amount of matches is smaller than n (n is the length of each sample). If it is smaller the samples are different and the result should be 1. Otherwise it should be zero. this can be done as follows:

finalDist = D < n ;

Implementing a custom distance function

Example for which the code works

Code

4 Answers