2
votes

Assuming that I have a dataset of the following size:

train = 500,000 * 960  %number of training samples (vector) each of 960 length

B_base = 1000000*960 %number of base samples  (vector) each of 960 length


Query = 1000*960  %number of query samples  (vector) each of 960 length

truth_nn = 1000*100

truth_nn contains the ground truth neighbors in the form of the pre-computed k nearest neighbors and their square Euclidean distance. So, the columns of truth_nn represent the k = 100 nearest neighbors. I am finding difficult to apply nearest neighbor search in the code snippet. Can somebody please show how to apply the ground truth neighbors truth_nn in finding the mean average precision-recall?

It will be of immense help if somebody can show with any small example by creating any data matrix, query matrix, and the ground truth neighbors in the form of the pre-computed k nearest neighbors and their square Euclidean distance. I tried creating a sample database.

Assume, the base data is

B_base = [1 1; 2 2; 3 2; 4 4; 5 6];

Query data is

 Query = [1 1; 2 1; 6 2];

[neighbors distances] = knnsearch(a,b,'k',2);

would find 2 nearest neighbors.

Question 1: how do I create the truth data containing the ground truth neighbors and pre-computed k nearest neighbor distances? This is called as the mean average precision recall. I tried implementing the knearest neighbor search and the average precision recall as follows but cannot understand (unsure) how to apply the ground truth table

Question 2:

I am trying to apply k nearest neighbor search by converting first the real-valued features into binary.

I am unable to apply the concept of k-nearest neighbor search for different values of k = 10,20,50 and to check how much data has been correctly recalled using the GIST database. In the GIST truth_nn() file, when I specify truth_nn(i,1:k) for a query vector i, the function AveragePrecision throws error. So, if somebody can show using any sample ground truth that is of similar structure to that in GIST, how to properly specify k and calculate the Average precision recall, then I shall be able to apply the solution to the GIST database. As of now, this is my approach and shall be of immense help if the correct way is provided using any example that will be easier for me to relate to the GIST database. The problem is on how I can find neighbors from the ground truth and compare it with the neighbors obtained after sorting the distances?

I am also interested on how I can apply pdist2() instead of the present distance calcualtion as it takes a long time.

 numQueryVectors = size(Query,1);
       %Calculate distances
     for i=1:numQueryVectors,
      queryMatrix(i,:)
      dist = sum((repmat(queryMatrix(i,:),numDataVectors,1)-B_base ).^2,2);
     [sortval sortpos] = sort(dist,'ascend');
      neighborIds(i,:) = sortpos(1:k);
     neighborDistances(i,:) = sqrt(sortval(1:k));
    end


        %Sorting calculated nearest neighbor distances for k = 50



 %HOW DO I SPECIFY k = 50 in the ground truth, truth_nn
for i=1:numQueryVectors
  AP(i) = AveragePrecision(neighborIds(i,:),truth_nn(i,:));
end
mAP = mean(AP);


  function ap = AveragePrecision(rank_id, truth_id)
    truth_num = length(truth_id);


truth_pos = zeros(truth_num,1);

for j=1:50  %% for k = 50 nearest neighbors
    truth_pos(j) = find(rank_id == truth_id(j));
end
truth_pos = sort(truth_pos, 'ascend');

% compute average precision as the area below the recall-precision curve
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
    p = j/truth_pos(j);
    ap = ap + p*delta_recall;
end

    end
end

UPDATE : Based on solution, I tried to calculate the mean average precision using the formula given formula hereand a reference code . But, not sure if my approach is correct because the theory says that I need to rank the returned queries based on the indices. I do not understand this fully. Mean average precision is required to judge the quality of the retrieval algortihm.

precision = positives/total_data;
recal = positives /(positives+negatives);
precision = positives/total_data;
recall = positives /(positives+negatives);
truth_pos = sort(positives, 'ascend');
truth_num = length(truth_pos);

ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
    p = j/truth_pos(j);
    ap = ap + p*delta_recall;
end
ap

The value of ap = infinity , value of positive = 0 and negatives = 150. This means that knnsearch() does not work at all.

1
How is this different from your other question(s) on this topic?beaker
@beaker: In my other Question, I had asked how to create multiple hash tables for alogirhtm - Locality Sensitive Hashing. Then I asked how I can work with GIST database. In particular, I am struggling in how to apply the ground truth table that consists of the actual labels and distance. Since, this was a very specific question I thought of asking a general one where I created a simple query and base data. Now, I do not know how I can create the ground truth table. My objective is to apply the nearest neighbor search and assess the quality using the average precision recall metric.SKM
In order to apply average precision recall, I believe we need the ground truth table. The GIST database does have one, but I do not understand how to use it. Therefore, I ask here to help in showing how to apply the ground truth in nearest neighbor and calculating the average precision recall, with the help of any sample ground table that has the same structure as that of the GIST database's ground truth table.SKM
@halfer: I have answered to the comment, please let me know if it is clear or not. Thank youSKM
I've no idea, not my area - but it is good to reply!halfer

1 Answers

1
votes

I think you are doing extra work. This process is very simple in matlab, you can also operate on entire arrays. This should be faster than for loops, and is a bit easier to read.

Your truth_nn and neighbors should have the same data, if there are no errors. There is one entry per row. Matlab already sorts the kmeans result in ascending order, so the column 1 is the closest neighbor, the second closest is column 2, 3rd closest is 3,.... There is no need to sort the data again.

Just compare truth_nn to neighbors to get your statistics. This is a simple example to show you how the program should go. It will not work on your data without some modification

%in your example this is provided, I created my own
truth_nn = [1,2;
            1,3;
            4,3];

B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query = [1 1; 2 1; 6 2];

%performs k means
num_clusters = 2;
[neighbors distances] = knnsearch(B_base,Query,'k',num_clusters);

%--- output---
% neighbors = [1,2; 
%              1,2; notice this doesn't match truth_nn 1,3
%              4,3]
% distances = [     0    1.4142;
%              1.0000    1.0000;
%              2.8284    3.0000];

%computes statistics, nnz counts number of nonzero elements, in the first
%case every piece of data that matches 
%NOTE1: the indexing on truth_nn (:,1:num_clusters ) it says all rows
%       but only use the first num_clusters columns. This should
%       prevent the dimension mistmatch error you were getting
positives = nnz(neighbors == truth_nn(:,1:num_clusters ));     %result = 5
negatives = nnz(neighbors ~= truth_nn(:,1:num_clusters ));     %result = 1
%NOTE1: I've switched this from truth_nn to neighbors, this helps
%       when you cahnge num_neghbors 
total_data = numel(neighbors);               %result = 6

percent_incorrect = 100*(negatives / total_data);   % 16.6666
percent_correct   = 100*(positives / total_data);   % 93.3333