What is the correct version of Average precision?

Question

I'm trying to compute the Average Precision (and Mean Average Precision) on the Oxford Building image dataset.

Below there is the code that they provide for computing Average Precision. Notice that pos_set is the union of the "optimal" and "good" images from the ground trouth set, while junk_set is a set of not-relevant images.

void OxfordTest::computeAp(std::vector<std::string> &ranked_list){
      float old_recall = 0.0;
      float old_precision = 1.0;
      float ap = 0.0;

      size_t intersect_size = 0;
      size_t i = 0;
      size_t j = 0;
      for ( ; i<ranked_list.size(); ++i) {
              if(!pos_set.count(ranked_list[i]))
                  std::cin.get();
        }
        if (junk_set.count(ranked_list[i])) continue; 
        if (pos_set.count(ranked_list[i])) intersect_size++;

        float recall = intersect_size / (float)pos_set.size();
        float precision = intersect_size / (j + 1.0);

        ap += (recall - old_recall)*((old_precision + precision)/2.0);

        old_recall = recall;
        old_precision = precision;
        j++;
      }
}

Which is something totally different from the notion given on the linked Wikipedia page. What is the correlation between these notions?

I'm more than sure that Wikipedia's notion is correct, since it corresponds with the one given in this answer and this article.

I don't understand why in the code above it is reported:

The recall, while the Wikipedia's notion include only precision in the last formula.
Even considering the formula with the delta recall, nobody talks about `(old_precision + precision) /2

This is the C++ original code.

Relja Arandjelović Relja Arandjelović · Accepted Answer · 2016-12-22T13:14:53

Junk set

The original paper states:

(3) Junk – less than 25% of the object
is visible, or there is a very high level of occlusion or distortion.
(4) Absent – the object is not present

I.e. junk images are not negatives. There are positives (OK+Good), ignores (Junk) and negatives (Absent). Note that all these are per-query, i.e. some images are junk for query 1 but not for query 15. If you look at the images that are 'junk' you'll see ambiguous examples, e.g. some cases have extreme zoom or blurring which will make you think if this image contains the queried landmark or not, and cases where only a tiny part of the object is visible so the image is too hard.

In computing the average precision, we use the Good and
Ok images as positive examples of the landmark in question,
Absent images as negative examples and Junk images
as null examples. These null examples are treated as though
they are not present in the database – our score is unaffected
whether they are returned or not.

So the authors defined the junk set to be neither positives nor negatives - the images most likely depict the queried object, but for some of them we are not sure, or it would be too harsh to treat them as positives and ask the system to retrieve these examples (and therefore penalise if it doesn't). At the same time, it would also be harsh to treat them as negatives as if the system does retrieve them, it shouldn't be penalised. So all that needs to be done is that (on a per-query basis) you ignore the junks and treat them as if they don't exist. So you take the retrieved list, filter out all junk images for this query, then run normal AP computation on this filtered list. That's what the code is doing effectively - when the example is in amb(=junk), it is just skipped. Then if the example is not in amb, if it's in pos(itives) the intersect_size (current number of positives up until position i) is incremented. The quantity j (well, j-1) is the number of non-skipped elements in the list (it gets incremented only if the current element is not junk).

AP computation

You certainly need the recall in your AP computation, as explained by shiri in the previous answer, and as described in your article, p(r) is the precision at a particular recall. The best way to think of AP is not to examine a random formula but to understand what is the intuition and then see how the formula captures it, i.e. what wikipedia says at the start: you can plot precision as a function of recall, and AP is then simply the area under the curve. You want the precision to be high at all recalls, so the ideal curve is p(r)=1 which would maximise the AP.

So what is the code doing? It's computing the area under the precision-recall curve using the trapezoidal rule, see this equation on Wikipedia and you'll see it's identical to the code. The AP computation for the discrete case from your Wikipedia article is a (commonly used) worse approximation to the area under the precision-recall curve, the rectangle method.

What is the correct version of Average precision?

2 Answers

Junk set

AP computation