Stemmers vs Lemmatizers

Question

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.

Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.

Stemmers

[in]: having
[out]: hav

Lemmatizers

[in]: having
[out]: have

So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
If not, then how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?
How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?

Is there a particular task you have in mind? It's always easier to answer an NLP question in context. — Renaud
To create a lemmatizer flexible for any NLP task that requires different level of abstraction =) — alvas
See also: what is the true difference between lemmatization vs stemming? — hippietrail

Jirka Jirka · Accepted Answer · 2013-06-26T12:46:55

Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"

Yes. Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving by driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.

Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?

What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?

If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.

If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, some might want it fined-grained.

Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"

What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).

With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although combination of machine learning and manually created rules has been done too (see e.g. this).

Obviously for most languages you do not want to create the lookup table by hand, but instead generate it from a description of morphology of that language. For inflectional languages, you can go the engineering way of Hajic for Czech or Mikheev for Russian, or, if you are daring, you use two-level morphology. Or you can do something in between, such as Hana (myself) (Note that these are all full morphological analyzers that include lemmatization). Or you can learn the lemmatizer in an unsupervised manner a la Yarowsky and Wicentowski, possibly with manual post-processing, correcting the most frequent words.

There are way too many options and it really all depends what you want to do with the results.

Stemmers vs Lemmatizers

4 Answers