How to calculate tag-wise precision and recall for POS tagger?

Question

I am using some rule-based and statistical POS taggers to tag a corpus(of around 5000 sentences) with Parts of Speech(POS). Following is a snippet of my test corpus where each word is seperated by its respective POS tag by '/'.

No/RB ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./.
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/NNP as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/NNP plunged/VBD 190.58/CD points/NNS --/: most/JJS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NN ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VBP 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/NN panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.

After tagging the corpus, it looks like this:

No/DT ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./. 
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/VB as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/JJ plunged/VBN 190.58/CD points/NNS --/: most/RBS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NNS ./. 
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VB 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/VBG panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.

I need to calculate the tagging accuracy(Tag wise- Recall & Precision), therefore need to find an error(if any) in tagging for each word-tag pair.

The approach I am thinking of is to loop through these 2 text files and store them in a list and later compare the 'two' lists element by element.

The approach seems really crude to me, so would like you guys to suggest some better solution to the above problem.

From the wikipedia page:

In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).

Tommy Herbert Tommy Herbert · Accepted Answer · 2011-03-10T23:01:47

Note that since every word has exactly one tag, overall recall and precision scores are meaningless for this task (they'll both just equal the accuracy measure). But it does make sense to ask for recall and precision measures per tag - for example, you could find the recall and precision for the DT tag.

The most efficient way to do this for all tags at once is similar to the way you suggested, though you can save one pass over the data by skipping the list-making stage. Read in a line of each file, compare the two lines word by word, and repeat until you reach the end of the files. For each word comparison, you probably want to check the words are equal for sanity, rather than assuming the two files are in sync. For each kind of tag, you keep three running totals: true positives, false positives and false negatives. If the two tags for the current word match, increment the true positive total for the tag. If they don't match, you need to increment the false negative total for the true tag and the false positive total for the tag your machine mistakenly chose. At the end, you can calculate recall and precision scores for each tag by following the formula in your Wikipedia excerpt.

I haven't tested this code and my Python's a but rusty, but this should give you the idea. I'm assuming the files are open and the totals data structure is a dictionary of dictionaries:

finished = false
while not finished:
    trueLine = testFile.readline()
    if not trueLine: # end of file
        finished = true
    else:
        trueLine = trueLine.split() # tokenise by whitespace
        taggedLine = taggedFile.readline()
        if not taggedLine:
            print 'Error: files are out of sync.'
        taggedLine = taggedLine.split()
        if len(trueLine) != len(taggedLine):
            print 'Error: files are out of sync.'
        for i in range(len(trueLine)):
            truePair = trueLine[i].split('/')
            taggedPair = taggedLine[i].split('/')
            if truePair[0] != taggedPair[0]: # the words should match
                print 'Error: files are out of sync.'
            trueTag = truePair[1]
            guessedTag = taggedPair[1]
            if trueTag == guessedTag:
                totals[trueTag]['truePositives'] += 1
            else:
                totals[trueTag]['falseNegatives'] += 1
                totals[guessedTag]['falsePositives'] += 1

How to calculate tag-wise precision and recall for POS tagger?

1 Answers