In the context of Information Retrieval, some papers like this one talk about Aggregate Precision-Recall curves (cf figure 3). What is the difference between these curves and Precision-Recall curves ? The authors of this paper seem to make a difference between the two, because they describe the curves shown in figure 4 as Precision-Recall curves and not Aggregate Precision-Recall curves (cf section 4.5)
2 Answers
Aggregate vs. Non-aggregate P&R Curves
In general, there is a difference between precision-recall curves and aggregate precision recall curves. You typically create a precision-recall curve for a single query (query=entity in this paper) given a system -- by slicing up the ranking and calculating both precision and recall at every point, you can plot this curve.
When you have a few hundred queries (entities), as is typical in papers, you can't show a few hundred graphs (nor could humans interpret them...), so what you do is average the curves somehow. They are referring to this as "aggregate" precision recall curves in this work. It is a little unfortunate they do not specify their aggregation method, but it would be reasonable to assume they use the mean, which is quite typical for these curves. I like to mention the software package I used to do it in situations like this, since it's difficult to know exactly how to group recalls across queries.
On your more specific question (about Figures 3 & 4):
They're not actually making a difference between Figure 3 and Figure 4 in this paper; they're just less precise in their references to Figure 4. At the very end of section 4.1 (Dataset and Evaluation Metrics) they mention that they
report both the aggregate curves precision/recall curves and Precision@N (P@N) in our experiments
This is a typical convention of papers. Unless specifically stated otherwise, you can assume that graphs and measures refer to those described in a setup section like this one.
There are multiple relations considered. For each one of them, we order the instances discovered from the test set with respect to the confidence score (which is encoded in the output of the network), and report the precision and recall values. Once this is done for all the relation types, the precision and recall curves are averaged, so that in final we only have ONE list of precision recall values parameterized by the number of retrievals. How exactly the average is computed is not clearly stated in the paper. The plot of this list is what is referred to as the aggregate precision-recall curve. Thanks to @John Foley!