33
votes

Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.

But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.

6

6 Answers

51
votes

The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)

[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today

This has 3 entities.

Supposing your actual extraction has the following

[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]

You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve

We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.

Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)

Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33

Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)

Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66

The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.

It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.

13
votes

In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition'. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.

However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.

3
votes

As mentioned before, there are different ways of measuring NER performance. It is possible to evaluate separately how precisely entities are detected in terms of position in the text, and in terms of their class (person, location, organization, etc.). Or to combine both aspects in a single measure.

You'll find a nice review in the following thesis: D. Nadeau, Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007). Have a look at section 2.6. Evaluation of NER.

2
votes

There is no simple right answer to this question. There are a variety of different ways to count errors. The MUC competitions used one, other people have used others.

However, to help you with your immediate confusion:

You have a set of tags, no? Something like NONE, PERSON, ANIMAL, VEGETABLE?

If a token should be person, and you tag it NONE, then that's a false positive for NONE and a false negative for PERSON. If a token should be NONE and you tag it PERSON, it's the other way around.

So you get a score for each entity type.

You can also aggregate those scores.

2
votes

Just to be clear, these are the definitions:

Precision = TP/(TP+FP) = What portion of what you found was ground truth?

Recall = TP/(TP+FN) = What portion of the ground truth did you recover?

The won't necessarily always be equal, since the number of false negatives will not necessarily equal the number of false positives.

If I understand your problem right, you're assigning each token to one of more than two possible labels. In order for precision and recall to make sense, you need to have a binary classifier. So you could use precision and recall if you phrased the classifier as whether a token is in Group "A" or not, and then repeat for each group. In this case a missed classification would count twice as a false negative for one group and a false positive for another.

If you're doing a classification like this where it isn't binary (assigning each token to a group) it might be useful instead to look at pairs of tokens. Phrase your problem as "Are tokens X and Y in the same classification group?". This allows you to compute precision and recall over all pairs of nodes. This isn't as appropriate if your classification groups are labeled or have associated meanings. For example if your classification groups are "Fruits" and "Vegetables", and you classify both "Apples" and "Oranges" as "Vegetables" then this algorithm would score it as a true positive even though the wrong group was assigned. But if your groups are unlabled, for example "A" and "B", then if apples and oranges were both classified as "A", afterward you could say that "A" corresponds to "Fruits".

0
votes

If you are training an spacy ner model then their scorer.py API which gives you precision, recall and recall of your ner.

The Code and output would be in this format:-

17

For those one having the same question in the following link:

spaCy/scorer.py '''python

import spacy

from spacy.gold import GoldParse

from spacy.scorer import Scorer

def evaluate(ner_model, examples):

scorer = Scorer()
for input_, annot in examples:
    doc_gold_text = ner_model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot)
    pred_value = ner_model(input_)
    scorer.score(pred_value, gold)
return scorer.scores

example run

examples = [ ('Who is Shaka Khan?', [(7, 17, 'PERSON')]), ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')]) ]

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm' results = evaluate(ner_model, examples) ''' Output will be in format like:- {'uas': 0.0, 'las': 0.0, 'ents_p': 43.75, 'ents_r': 35.59322033898305, 'ents_f': 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}strong text