0
votes

Simplified tags after the POS tagging by NLTK have been calculated.

simplified = [(word, simplify_wsj_tag(tag)) for word, tag in posTagged]
print(simplifiedTags)
#[('And', 'CONJ'), ('now', 'ADV'), ('for', 'ADP'), ('something', 'NOUN'), ('completely', 'ADV'), ('different', 'ADJ')]

Now the lemma for each word has to be found. Each of these, except conjuction, can be mapped to a wordnet POS class - noun, adjective, adverb, verb. What is supposed to be done with the words labelled as Conjuction? Which is the closest relative of conjuction amongst all the four classes? Or are they supposed to be dropped from the sentence all together?

2
In English, conjunctions and adverbs share the property of not being inflected. This means that a lemmatisation function should always return its input unchanged for members of this POS class. So I suggest you use pos='r' when calling WordNetLemmatizer.lemmatize. - lenz

2 Answers

0
votes

I think we can use the default one for pos tagger which is noun for parts of speech other than VERB,ADVERB,ADJECTIVE,NOUN.

https://bommaritollc.com/2014/06/30/advanced-approximate-sentence-matching-python/

The above website's Approach #6 does the same thing.

0
votes

Conjunctions are already in their lemma form, so you could skip them