1
votes

A and B are matrices consisting of binary elements. A is denoted as the base data matrix and B is the query matrix. A consists of 75 data points each of length 10 and B consists of 50 data points each of length 10. I want to calculate the distance between all data points in A and every query data point in B in order to apply nearest neighbor search. So instead of using the Euclidean or the hamming distance, I have used another metric :

metric

N = 2, k = length of data samples, s = A(1,:) and t = B(1,:). The code works for one data sample in A and another data sample in B. How do I scale so that it works for all base data points and all query data points?

Example for which the code works

Let A(1,:) = [1,0,1,1,0,0,0,1,1,0] is the first sample in A matrix. Let B(1,:) = [1,1,0,0,1,1,1,1,0,0] is the first query point.

If the elements in samples taken from A and B are same, 0 is recorded for each similar element, otherwise 1. The final distance is the sum of the 1's. So the program checks to see if two sequences are the same, setting b to 1 if so, or a zero otherwise. Can somebody please show how I can apply this to matrices?

Code

l = length(A);

D=zeros(1,l);
for i=1:l,
    if A(1,i)==B(1,i),
        D(1,i)=0;
    else 
        D(1,i)=1;
    end
end

sum=0;
for j=1:l,
    sum=sum+D(1,j);
end

if sum==0, 
    b = 1;
else 
    b = 0;
end
4
You mention that your distance metric is not the Hamming distance but the first for loop looks like it is similar... can you clarify? The Hamming distance adds up the total number of disagreeing positions between corresponding elements. However, you are calculating the total number of agreeing positions, which is what your first for loop is doing. Also, are you saying that this code works between two query vectors and you want to extend to matrices? I would like to write a more vectorized approach, but if you are bent in using loops I can live with that. Please clarify.rayryeng
In the explanation you are saying " If the elements in samples taken from A and B are same, 1 is recorded for each similar element, otherwise zero" but in the code you are doing the oppositeNovice_Developer
Please see the edited question where I have put the formula. The code works between one base vector and one query vector. I am asking how I can modify it to work for 75 base vectors and 50 query vectors. Thank youSKM

4 Answers

5
votes

One line solution

This calculation can be done in a single line of code:

D = A*B'+(1-A)*(1-B)' < size(A,2)

Explanation

Do to the fact that A and B are binary, the distance function between each sample at A and each sample at B basically checks if the amount of per-coordinates matches is equal to a sample's length. You can use matrix multiplication to achieve this.

More descriptive code example

Define A and B as two binary matrices as you mentioned in your answer:

%initializes A and B randomly
A = double(rand(75,10) > 0.5);
B = double(rand(50,10) > 0.5);
[m,n] = size(A);

The distance between each sample in A and each sample in B can be calculated as follows:

First, define a matrix D of size 75x50, s.t D(i,j) is contains the number of matches between the sample i in A and the sample j in B.

It can be calculated as follows:

D = A*B' + (1-A)*(1-B)';

The final distance measure can be done by testing for each pair (i,j) if their amount of matches is smaller than n (n is the length of each sample). If it is smaller the samples are different and the result should be 1. Otherwise it should be zero. this can be done as follows:

finalDist = D < n ;
2
votes

If you want to get your code working as is with loops, simply allocate space that is size(A,1) x size(B,1) large so that each spatial location (i,j) gives you "distance" between rows i and j.

Therefore, do something like this. This is assuming that A is a M x d matrix and B is a N x d matrix where d are the total number of feature points and M and N are arbitrary positive numbers that denote how many rows of elements there are in each.

b = zeros(size(A,1), size(B,1)); % Change
l = size(A,2); % Change - Total number of feature points

for ii = 1 : size(A,1) % Change
    for jj = 1 : size(B,1)
        D=zeros(1,l);
        for i=1:l,
            if A(ii,i)==B(jj,i) % Change
                D(i)=0;
            else 
                D(i)=1;
            end
        end

        sum=0;
        for j=1:l,
            sum=sum+D(j);
        end

        if sum==0, 
            b(ii,jj) = 1; % Change
        end
    end
end

This will iterate over all combinations of rows. However, use any of the previous answers here to get it vectorized. I just wanted to show you how you would modify your current code if this is what you're most comfortable with.

2
votes

Your distance metric is actually just the L1-norm i.e. sum(abs(x-y)) so in Octave you can use pdist2 like this:

pdist2(A,B,'L1')

In MATLAB you can use city block distance:

pdist2(A,B,'cityblock')

Note, to define your own distance metric (but 'cityblock' is a better idea):

pdist2(A,B,@(x,y)sum(abs(bsxfun(@minus,x,y)),2))

Or

pdist2(A,B,@(x,y)sum(bsxfun(@xor,x,y),2))

Or

pdist2(A,B,@(x,y)sum(bsxfun(@ne,x,y),2))

The distance of one of your vectors with another can be found like this:

distance = @(x,y)sum(x~=y)

However you want to compare all the row of A with all the rows of B. bsxfun is going to be useful here, we just need to use permute to make one of the matrices go into the third dimension:

D = squeeze(sum(bsxfun(@ne, permute(B, [3,2,1]),A),2))'

For example if

A = [1,1,0;
     0,0,1;
     1,1,1];

B = [1,1,1;
     0,0,0;
     1,1,1;
     0,1,0]

Then

> D = squeeze(sum(bsxfun(@ne, permute(B, [3,2,1]),A),2))'

D =

   1   2   0
   2   1   3
   1   2   0
   1   2   2

So the columns are now the rows of A and he rows are the rows of B so D(2,3) means compare B(2,:) (which is [0,0,0] with A(3,:) which is [1,1,1], and so since all the elements are different, the distance between them is 3.

If you have the stats toolbox then you can use my distance function above with pdist2 instead.

0
votes

thats what i understood from your description(correct me if I am wrong) Let A be a matrix

A =

     0     1     0     0     1     0     0     1     0     1
     1     1     0     0     1     1     1     0     0     1
     0     1     1     0     0     1     1     0     1     1
     0     1     1     1     1     1     1     0     0     0
     1     1     0     0     0     1     0     0     1     1
     0     0     0     0     0     1     0     0     1     0
     1     0     1     0     1     0     0     1     1     0
     1     1     0     0     1     0     1     0     0     0
     1     0     1     0     0     1     0     0     1     0
     0     0     1     0     1     0     1     1     0     0

and B be

B =

     0     1     1     1     0     1     0     0     1     0
     0     1     0     1     0     0     1     1     1     1
     1     0     0     1     1     0     0     0     1     1
     0     1     0     1     1     0     0     0     1     0
     1     1     0     0     0     0     0     0     0     1
     1     1     0     1     1     1     1     1     1     0
     1     1     0     0     1     0     0     1     0     1
     0     0     1     1     0     1     1     1     0     1
     0     0     0     1     1     0     0     0     1     0
     0     1     1     0     0     0     1     1     0     1
>> C = A.*B

will give you common points between them if A has more number of row lets say then you can do is A(1:size(B,1),:).*B instead

C =

     0     1     0     0     0     0     0     0     0     0
     0     1     0     0     0     0     1     0     0     1
     0     0     0     0     0     0     0     0     1     1
     0     1     0     1     1     0     0     0     0     0
     1     1     0     0     0     0     0     0     0     1
     0     0     0     0     0     1     0     0     1     0
     1     0     0     0     1     0     0     1     0     0
     0     0     0     0     0     0     1     0     0     0
     0     0     0     0     0     0     0     0     1     0
     0     0     1     0     0     0     1     1     0     0

In the find if there are no matching points b = 1 otherwise 0

b = ~find(sum(sum(C)))

Update : if D should be 75x50 like you say so then C should be

  C = A*(B.')

instead of

C = A.*B

because first i thought its a dot comparison from your code