I've been using the NLTK Unigram tagger with the model keyword to pass in a list of words for specific tagging:
nd = dict((x,'CFN') for x in common_first_names)
...
t4 = nltk.UnigramTagger(model=nd, backoff=t3)
I have very specific information I want to extract from my documents, and a wide range of documents with very different punctuation, capitalization, and grammatic quality, so using a pre-existing corpus for training hasn't proven very successful. I've been doing my own tagging as shown above, along with RegExp and Default taggers to tag things exactly as I want. I wanted to use Bigram and Trigram taggers in a similar way as above, passing in a model of word combinations so that the last word in the sequence gets tagged depending on the words that precede it, something like:
# 'the' gets different tag depending on preceding word
{
('for','the') : 'FT',
('into','the') : 'IT',
('on','the') : 'OT'
}
But I discovered the hard way (code reading, debugging, and then finally re-reading the book where it was stated clearly) that Ngram taggers use tags, not tokens, for left context. Since 'for', 'into', and 'on' would likely all be tagged the same way, this doesn't give me a way to distinguish between them. Also, the reliance on tags makes Ngram taggers pretty useless overall unless you have a large and relevant training set, since they break as soon as they see an untagged word or a word tagged in a way not in the training data.
I've done a fair amount of searching and haven't found any discussion of this anywhere. Every discussion of Ngram taggers beyond Unigram taggers seem to expect training data, rather than a model. Is there any way to tag with tokens as context, not tags? Thanks