R - Package tm - Which terms correspond to each common root after stemming?

Question

Corpus created, stopwords defined, cleansing done (removePunctuation, removeNumbers, tolower...).

The corpus is now ready to be stemmed. The function is executed correctly and all works as it should, but...

I need to know which words are being stemmed to each common root. Is that possible using the tm package? Or any other package?

For example, TermA1, TermA2, TermB1, TermB2, TermB3, all of them are stemmed to Term and my new Corpus reflect only Term. However, I need also to know which words are associated with each root word, and therefore an optimal output should be:

Term     Stemm
TermA1   Term
TermA2   Term
TermB1   Term
TermB2   Term
TermB3   Term
...
WordA1   Word
WordB1   Word
WordB2   Word
WordB3   Word
WordC1   Word

tia_0 tia_0 · Accepted Answer · 2016-07-21T21:30:55

In the tm package there is the function stemCompletion that allows you to complete each stemmed word given a specific dictionary.

To obtain your output do as follows:

library(tm)
data("crude")
words <- stemCompletion(c("compan", "entit", "suppl"), crude)
stemmed <-  names(words)
stemcomp <- unname(words)
data.table(stemmed, stemcomp)

References: stemCompletion {tm}

[UPDATE: more german words]

I tried this to verify the behavior with german vowels:

library(SnowballC)
library(tm)
library(data.table)

text <- c("für", "aktuelle", "Nachrichten", "und", "Themen", "Bilder",
       "und", "Videos", "aus", "den", "Bereichen", "News", "Wirtschaft","Politik","können", "Fremdschämen", "Lebensmüde", "Erklärungsnot")

stem <- stemmed <- wordStem(text, language = "porter")
completed <- stemCompletion(stemmed, text)
comparison <- data.table(text, stemmed, completed)

In the table comparison you can see that the original words with the german vowels are not being stemmed but, if you try to complete a certain given stem like "f" with stemCompletion("f", text) you will obtain the correct word "für". This is strange, maybe you can follow from here and try to find some work around.

R - Package tm - Which terms correspond to each common root after stemming?

1 Answers