The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today
This has 3 entities.
Supposing your actual extraction has the following
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]
You have an exact match for Microsoft Corp
, false positives for CEO
and today
, a false negative for Windows 7
and a substring match for Steve
We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.
Exact match: True Positives = 1 (Microsoft Corp.
, the only exact match), False Positives =3 (CEO
, today
, and Steve
, which isn't an exact match), False Negatives = 2 (Steve Ballmer
and Windows 7
)
Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33
Any Overlap OK: True Positives = 2 (Microsoft Corp.
, and Steve
which overlaps Steve Ballmer
), False Positives =2 (CEO
, and today
), False Negatives = 1 (Windows 7
)
Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66
The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.
It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.