Convert the Brown corpus tagset to upenn tagset

Question

I want to find words in the Brown corpus with a certain upenn tag. I've tried doing this using the following code:

poss = ['TO', 'NNS', 'RB', 'DT', 'VBD', 'JJ', 'RBS',
       'PDT', 'IN', 'VBN', 'RP', 'NN', 'VB', 'CC',
       'JJS', 'VBG', 'WRB', 'PRP$', 'WP$', 'WP',
       'EX', 'CD', 'JJR', 'VBZ', 'MD', 'VBP', 'WDT', 'PRP', 'RBR']

PARTS_OF_SPEECH = {p: set() for p in poss}

words = set([(w, t) for w, t in nltk.corpus.brown.tagged_words()])

for word, tag in words:
    if tag in poss:
         PARTS_OF_SPEECH[tag].add(word)

so I can do PARTS_OF_SPEECH["NN"] to get all the words in Brown with the upenn tag "NN".

Unfortunately, this doesn't work, because brown.tagged_words() returns words tagged with Brown tags rather than upenn tags, which are slightly different. I know that there is a tagset keyword argument to brown.tagged_words(), but I can't find any arguments it takes other than "universal", which isn't what I want. Is there some argument that returns upenn tags?

You probably have to create a mapping from Brown to upenn tags, or possibly find an existing one. This won't be perfect, though, since tagsets are not easily exchanged. Each tagset is based on separate annotation guidelines, inevitably including differences in the underlying linguistic theory. — lenz

rlms rlms · Accepted Answer · 2015-08-15T20:45:53

Currently, this doesn't seem to be possible (see this issue). Workarounds using third-party tools such as this might work.

Convert the Brown corpus tagset to upenn tagset

1 Answers