WordNetLemmatizer
uses the ._morphy
function to access its a word's lemma; from http://www.nltk.org/_modules/nltk/stem/wordnet.html and returns the possible lemmas with the minimum length.
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
And the ._morphy
function apply rules iteratively to get a lemma; the rules keep reducing the length of the word and substituting the affixes with the MORPHOLOGICAL_SUBSTITUTIONS
. then it sees whether there are other words that are shorter but the same as the reduced word:
def _morphy(self, form, pos):
# from jordanbg:
# Given an original string x
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database
# 3. If there are no matches, keep applying rules until you either
# find a match or you can't go any further
exceptions = self._exception_map[pos]
substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
def apply_rules(forms):
return [form[:-len(old)] + new
for form in forms
for old, new in substitutions
if form.endswith(old)]
def filter_forms(forms):
result = []
seen = set()
for form in forms:
if form in self._lemma_pos_offset_map:
if pos in self._lemma_pos_offset_map[form]:
if form not in seen:
result.append(form)
seen.add(form)
return result
# 0. Check the exception lists
if form in exceptions:
return filter_forms([form] + exceptions[form])
# 1. Apply rules once to the input to get y1, y2, y3, etc.
forms = apply_rules([form])
# 2. Return all that are in the database (and check the original too)
results = filter_forms([form] + forms)
if results:
return results
# 3. If there are no matches, keep applying rules until we find a match
while forms:
forms = apply_rules(forms)
results = filter_forms(forms)
if results:
return results
# Return an empty list if we can't find anything
return []
However if the word is in the list of exceptions, it will return a fixed value kept in the exceptions
, see _load_exception_map
in http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:
def _load_exception_map(self):
# load the exception file data into memory
for pos, suffix in self._FILEMAP.items():
self._exception_map[pos] = {}
for line in self.open('%s.exc' % suffix):
terms = line.split()
self._exception_map[pos][terms[0]] = terms[1:]
self._exception_map[ADJ_SAT] = self._exception_map[ADJ]
Going back to your example, worse
-> bad
and further
-> far
CANNOT be achieved from the rules, thus it has to be from the exception list. Since it's an exception list, there are bound to be inconsistencies.
The exception list are kept in ~/nltk_data/corpora/wordnet/adv.exc
and ~/nltk_data/corpora/wordnet/adv.exc
.
From adv.exc
:
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard
From adj.exc
:
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...