1
votes

I've been using the NLTK Unigram tagger with the model keyword to pass in a list of words for specific tagging:

nd = dict((x,'CFN') for x in common_first_names)
...
t4 = nltk.UnigramTagger(model=nd, backoff=t3)

I have very specific information I want to extract from my documents, and a wide range of documents with very different punctuation, capitalization, and grammatic quality, so using a pre-existing corpus for training hasn't proven very successful. I've been doing my own tagging as shown above, along with RegExp and Default taggers to tag things exactly as I want. I wanted to use Bigram and Trigram taggers in a similar way as above, passing in a model of word combinations so that the last word in the sequence gets tagged depending on the words that precede it, something like:

# 'the' gets different tag depending on preceding word
{
('for','the') : 'FT',  
('into','the') : 'IT',
('on','the') : 'OT'
}

But I discovered the hard way (code reading, debugging, and then finally re-reading the book where it was stated clearly) that Ngram taggers use tags, not tokens, for left context. Since 'for', 'into', and 'on' would likely all be tagged the same way, this doesn't give me a way to distinguish between them. Also, the reliance on tags makes Ngram taggers pretty useless overall unless you have a large and relevant training set, since they break as soon as they see an untagged word or a word tagged in a way not in the training data.

I've done a fair amount of searching and haven't found any discussion of this anywhere. Every discussion of Ngram taggers beyond Unigram taggers seem to expect training data, rather than a model. Is there any way to tag with tokens as context, not tags? Thanks

1

1 Answers

2
votes

I think I managed to come up with a solution, though it was a guess after extensive code inspection. I've created my own Ngram tagger as a subclass of the NLTK NgramTagger class, as follows:

class myNgramTagger(nltk.NgramTagger):
    """
    My override of the NLTK NgramTagger class that considers previous
    tokens rather than previous tags for context.
    """
    def __init__(self, n, train=None, model=None,
                 backoff=None, cutoff=0, verbose=False):
        nltk.NgramTagger.__init__(self, n, train, model, backoff, cutoff, verbose)

    def context(self, tokens, index, history):
        #tag_context = tuple(history[max(0,index-self._n+1):index])
        tag_context = tuple(tokens[max(0,index-self._n+1):index])
        return tag_context, tokens[index]

The only line I changed was the commented one in the context method, where I changed the history list to the tokens list. I was pretty much just guessing that this might do what I wanted, but it seems to work with both model and training data.

test_sent = ["When","a","small","plane","crashed","into","the","river","a","general","alert","was","a","given"]

tm2 = {
    (('When',), 'a') : "XX",
    (('into',), 'the') : "YY",
}

tm3 = {
    (('a','general'), 'alert') : "ZZ",
}

taggerd = nltk.DefaultTagger('NA')
tagger2w = myNgramTagger(2,model=tm2,backoff=taggerd)
tagger3w = myNgramTagger(3,model=tm3,backoff=tagger2w)
print tagger3w.tag(test_sent)

[('When', 'NA'), ('a', 'XX'), ('small', 'NA'), ('plane', 'NA'), ('crashed', 'NA'), ('into', 'NA'), ('the', 'YY'), ('river', 'NA'), ('a', 'NA'), ('general', 'NA'), ('alert', 'ZZ'), ('was', 'NA'), ('a', 'NA'), ('given', 'NA')]

So just by changing one word in one method I seem to have managed to get what I want, Ngram tagging using tokens as context, rather than tags.

I tried something similar using the brown corpus with the news category to train with (hence my choice of test sentence) also and it seemed to work fine, actually better than with tags since it managed to tag everything in the sentence it could recognize, rather than stopping short once it saw something it didn't recognize:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagger_bigram = myNgramTagger(2,brown_tagged_sents)
brown_tagger_trigram = myNgramTagger(3,brown_tagged_sents,backoff=brown_tagger_bigram)
print brown_tagger_trigram.tag(test_sent)

[('When', u'WRB'), ('a', u'AT'), ('small', u'JJ'), ('plane', None), ('crashed', None), ('into', None), ('the', u'AT'), ('river', None), ('a', None), ('general', u'JJ'), ('alert', None), ('was', None), ('a', u'AT'), ('given', u'VBN')]

Comparing this to the normal NLTK Ngram tagger actually shows this to be an improvement:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagger_bigram = nltk.NgramTagger(2,brown_tagged_sents)
brown_tagger_trigram = nltk.NgramTagger(3,brown_tagged_sents,backoff=brown_tagger_bigram)
print brown_tagger_trigram.tag(test_sent)

[('When', u'WRB'), ('a', u'AT'), ('small', u'JJ'), ('plane', None), ('crashed', None), ('into', None), ('the', None), ('river', None), ('a', None), ('general', None), ('alert', None), ('was', None), ('a', None), ('given', None)]

Tagging with token context gives decent results all the way to the end of the sentence, whereas tagging with tag context only works up until the third word.