How is it possible to lemmatize words like makes
to make it make
using quanteda.
In Python it is possible to make it using NLTK WordNet Lemmatizer
Stemming can be done with tokens_wordstem
or dfm_wordstem
. But lemmatizing needs to be done with tokens_replace
. Note the diffence between the 2, in lemmatizing "am" is changed into "be" as this is the lemma.
In the lexicon package there is a table called hash_lemmas that you can use as a dictionary. There is no default lemma function in quanteda.
txt <- c("I am going to lemmatize makes into make, but not maker")
library(quanteda)
# stemming
tokens_wordstem(tokens(txt))
Tokens consisting of 1 document.
text1 :
[1] "I" "am" "go" "to" "lemmat" "make" "into" "make" "," "but" "not" "maker"
# lemmatizing using lemma table
tokens_replace(tokens(txt), pattern = lexicon::hash_lemmas$token, replacement = lexicon::hash_lemmas$lemma)
Tokens consisting of 1 document.
text1 :
[1] "I" "be" "go" "to" "lemmatize" "make" "into" "make" "," "but" "not"
[12] "maker"
Other lemma options are using spacyr in combination with quanteda. See tutorial with spacyr.
Or you can first use udpipe to get the lemma's and then use quanteda's tokens_replace
or dfm_replace
functions.