I have tried using quanteda to extract top features but the results were modified words, i.e. 'faulti' instead of 'faulty'. Is this supposed to be the expected results?
I have tried searching for the top features keywords in the original dataset but no match as expected.
Edit: if i set options stem=FALSE for function dfm() then the key words resumed to normal words.
library(quanteda)
corpus1 = corpus(as.character(training_data$Elec_rmk))
kwic(corpus1, 'faulty')
#[text25701, 4] Convertible roof sometime | faulty | . SD card missing.
#[text25701, 22] unavailable). Pilot lamp | faulty | .
dfm1 <- dfm(
corpus1,
ngrams = 1,
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
stem = TRUE)
tf1 <- topfeatures(dfm1, n = 10)
tf1
# key words were modified/truncated words?
#faulti malfunct light damag miss cover rear loos lamp plate
# 562 523 454 337 331 325 295 259 250 238
library(stringr)
sum(str_detect(training_data$Elec_rmk, 'faulti')) # 0
sum(str_detect(training_data$Elec_rmk, 'faulty')) # 495