Customized stopword list remove

Question

I try to use a customized word list to remove phrases from text.

This is a reproducable example.

I think something it is not right with my attempt:

mystop <-  structure(list(stopwords = c("remove", "this line", "remove this line", 
"two lines")), .Names = "stopwords", class = "data.frame", row.names = c(NA, 
-4L))
df <-  structure(list(stopwords = c("Something to remove", "this line must remove two tokens", 
"remove this line must remove three tokens", "two lines to", 
"nothing here to stop")), .Names = "stopwords", class = "data.frame", row.names = c(NA, 
-5L))
> mycorpus <- corpus(df$stopwords)
> mydfm <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE), c(stopwords("SMART"), mystop$stopwords)), , ngrams = c(1,3))
> 
> 
> #convert the dfm to dataframe
> df_ngram <- data.frame(Content = featnames(mydfm), Frequency = colSums(mydfm), 
+                  row.names = NULL, stringsAsFactors = FALSE)
> 
> df_ngram
  Content Frequency
1    line         2
2  tokens         2
3   lines         1
4    stop         1
> df
                                  stopwords
1                       Something to remove
2          this line must remove two tokens
3 remove this line must remove three tokens
4                              two lines to
5                      nothing here to stop

example in the dfm I should expect to find something like this Something to? I mean see every document to be clear without remove?

I would like to remove feature stopwords from ngram tokens. so I tried to use this:

mydfm2 <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE, ngrams = 1:3), remove = c(stopwords("english"), mystop$stopwords)))
Error in tokens_select(x, ..., selection = "remove") : 
  unused argument (remove = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", "should", "could", "ought", "i'm", "you're", 
"he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", "when's", "where's", "why's", "how's",

Edit with other example reproducable code: This is a dummy text I found from other question:

df <- structure(list(text = c("video game consoles stereos smartphone chargers and other similar devices constantly draw power into their power supplies. Unplug all of your chargers whether it's for a tablet or a toothbrush. Electronics with standby or \\\"\\\"sleep\\\"\\\" modes: Desktop PCs televisions cable boxes DVD-ray players alarm clocks radios and anything with a remote", 
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions the impugned order is in the teeth of the recommendations of the said Committee as communicated in its letter dated 14.05.2017", 
"... focus to the ayurveda sector especially in oral care. A year ago Colgate launched its first India-focused ayurvedic brand Cibaca Vedshakti aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products including toothpaste under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian", 
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali  even though both of these have enough local and multinational competitors in the organised", 
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees it has not been able to hear cases of human rights violations in Maharashtra. A division"
)), .Names = "text", class = "data.frame", row.names = c(NA, 
-5L))

The stopwords (I created this list using the ngram of quanteda)

mystop <- structure(list(stop = c("dated_modern_dental", "hiring", "local", 
"employees", "modern_dental_college", "multinational", "competitors", 
"state", "dental_college_research", "organised", "human", "rights", 
"college_research_centre", "commission", "founder_increate_advisors", 
"research_centre_supra", "sector_oral_care", "left", "toothless", 
"centre_supra_authorizing")), .Names = "stop", class = "data.frame", row.names = c(NA, 
-20L))

All step from the code:

library (quanteda)
library(stringr)
#text to lower case
df$text <- tolower(df$text)
#remove all special characters
df$text <- gsub("[[:punct:]]", " ", df$text)
#remove numbers
df$text <- gsub('[0-9]+', '', df$text)
#more in order to remove regular expressions like chinese characters
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
#remove long spaces
df$text <- gsub("\\s+"," ",str_trim(df$text))

This is the step I to make ngrams and also remove from the input text english stopwords in combination with my stopwords list.

myDfm <- dfm(tokens_remove(tokens(df$text, remove_punct = TRUE),  c(stopwords("SMART"), mystop$stop)), ngrams = c(1,3))

However if I convert the myDfm to dataset to see if the removal of stopwords worked and can see them again

df_ngram <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm), 
                 row.names = NULL, stringsAsFactors = FALSE)

I'm afraid that is a bafflingly unclear question, and not reproducible using your code. Please provide an example text, and the stopwords you wish to remove as a simple character vector, and what the output should look like. — Ken Benoit
@KenBenoit I added at the end of the code a better reproducible using code example. Is it helpful for better understanding? I want to use the dfm for topimodeling convertion which avaliable from quanteda — user8831872

Ken Benoit Ken Benoit · Accepted Answer · 2018-02-06T22:20:38

I'll try to provide the answer I think you want, although it's very hard to understand your question because the actual question is buried in a series of largely unnecessary steps not directly relevant to the question.

I think you are puzzled as to how to remove stopwords - in this case, some that you have supplied - and form ngrams.

Here's how to create the corpus and a character vector of stopwords. No need for lists, etc. Note that this is for quanteda v1.0.0 which now uses the stopwords package for its stopword lists.

mycorpus <- corpus(df$stopwords)
mystopwords <- c(stopwords(source = "smart"), mystop$stopwords)

Now we can manually build up the tokens, removing the stopwords but leaving a "pad" in their place, to prevent ngrams from being created from words that were never adjacent to begin with.

mytoks <- 
    tokens(mycorpus) %>%
    tokens_remove(mystopwords, padding = TRUE)
mytoks
# tokens from 5 documents.
# text1 :
# [1] "" "" ""
# 
# text2 :
# [1] ""       "line"   ""       ""       ""       "tokens"
# 
# text3 :
# [1] ""       ""       "line"   ""       ""       ""       "tokens"
# 
# text4 :
# [1] ""      "lines" ""     
# 
# text5 :
# [1] ""     ""     ""     "stop"

At this stage we can also apply ngrams either using tokens_ngrams() or the ngrams option from dfm(). Let's use the latter.

dfm(mytoks, ngrams = c(1,3))
# Document-feature matrix of: 5 documents, 4 features (70% sparse).
# 5 x 4 sparse Matrix of class "dfm"
#        features
# docs    line tokens lines stop
#   text1    0      0     0    0
#   text2    1      1     0    0
#   text3    1      1     0    0
#   text4    0      0     1    0
#   text5    0      0     0    1

No ngrams were created, since you can see from the above tokens printout, there were no remaining tokens adjacent to other tokens after removing the stopwords from the mystopwords vector.

Customized stopword list remove

1 Answers