I want to find words in the Brown corpus with a certain upenn tag. I've tried doing this using the following code:
poss = ['TO', 'NNS', 'RB', 'DT', 'VBD', 'JJ', 'RBS',
'PDT', 'IN', 'VBN', 'RP', 'NN', 'VB', 'CC',
'JJS', 'VBG', 'WRB', 'PRP$', 'WP$', 'WP',
'EX', 'CD', 'JJR', 'VBZ', 'MD', 'VBP', 'WDT', 'PRP', 'RBR']
PARTS_OF_SPEECH = {p: set() for p in poss}
words = set([(w, t) for w, t in nltk.corpus.brown.tagged_words()])
for word, tag in words:
if tag in poss:
PARTS_OF_SPEECH[tag].add(word)
so I can do PARTS_OF_SPEECH["NN"] to get all the words in Brown with the upenn tag "NN".
Unfortunately, this doesn't work, because brown.tagged_words() returns words tagged with Brown tags rather than upenn tags, which are slightly different. I know that there is a tagset keyword argument to brown.tagged_words(), but I can't find any arguments it takes other than "universal", which isn't what I want. Is there some argument that returns upenn tags?