Since there is no ready implementation of stopwords for Polish in quanteda, I would like to use my own list. I have it in a text file as a list separated by spaces. If need be, I can also prepare a list separated by new lines.
How can I remove the custom long list of stopwords from my corpus? How can I do that after stemming?
I have tried creating various formats, converting to string vectors like
stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)
I have also tried to use such vectors of words in syntax
myStemMat <-
dfm(
mycorpus,
remove = as.vector(stopwordsPL),
stem = FALSE,
remove_punct = TRUE,
ngrams=c(1,3)
)
dfm_trim(myStemMat, sparsity = stopwordsPL)
or
myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))
Nothing works. My stopwords show up in the corpus and in the analysis. What should be the proper way/syntax to apply custom stop words?