I try to use a customized word list to remove phrases from text.
This is a reproducable example.
I think something it is not right with my attempt:
mystop <- structure(list(stopwords = c("remove", "this line", "remove this line",
"two lines")), .Names = "stopwords", class = "data.frame", row.names = c(NA,
-4L))
df <- structure(list(stopwords = c("Something to remove", "this line must remove two tokens",
"remove this line must remove three tokens", "two lines to",
"nothing here to stop")), .Names = "stopwords", class = "data.frame", row.names = c(NA,
-5L))
> mycorpus <- corpus(df$stopwords)
> mydfm <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE), c(stopwords("SMART"), mystop$stopwords)), , ngrams = c(1,3))
>
>
> #convert the dfm to dataframe
> df_ngram <- data.frame(Content = featnames(mydfm), Frequency = colSums(mydfm),
+ row.names = NULL, stringsAsFactors = FALSE)
>
> df_ngram
Content Frequency
1 line 2
2 tokens 2
3 lines 1
4 stop 1
> df
stopwords
1 Something to remove
2 this line must remove two tokens
3 remove this line must remove three tokens
4 two lines to
5 nothing here to stop
example in the dfm I should expect to find something like this Something to
? I mean see every document to be clear without remove?
I would like to remove feature stopwords from ngram tokens. so I tried to use this:
mydfm2 <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE, ngrams = 1:3), remove = c(stopwords("english"), mystop$stopwords)))
Error in tokens_select(x, ..., selection = "remove") :
unused argument (remove = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", "should", "could", "ought", "i'm", "you're",
"he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", "when's", "where's", "why's", "how's",
Edit with other example reproducable code: This is a dummy text I found from other question:
df <- structure(list(text = c("video game consoles stereos smartphone chargers and other similar devices constantly draw power into their power supplies. Unplug all of your chargers whether it's for a tablet or a toothbrush. Electronics with standby or \\\"\\\"sleep\\\"\\\" modes: Desktop PCs televisions cable boxes DVD-ray players alarm clocks radios and anything with a remote",
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions the impugned order is in the teeth of the recommendations of the said Committee as communicated in its letter dated 14.05.2017",
"... focus to the ayurveda sector especially in oral care. A year ago Colgate launched its first India-focused ayurvedic brand Cibaca Vedshakti aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products including toothpaste under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian",
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali even though both of these have enough local and multinational competitors in the organised",
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees it has not been able to hear cases of human rights violations in Maharashtra. A division"
)), .Names = "text", class = "data.frame", row.names = c(NA,
-5L))
The stopwords (I created this list using the ngram of quanteda)
mystop <- structure(list(stop = c("dated_modern_dental", "hiring", "local",
"employees", "modern_dental_college", "multinational", "competitors",
"state", "dental_college_research", "organised", "human", "rights",
"college_research_centre", "commission", "founder_increate_advisors",
"research_centre_supra", "sector_oral_care", "left", "toothless",
"centre_supra_authorizing")), .Names = "stop", class = "data.frame", row.names = c(NA,
-20L))
All step from the code:
library (quanteda)
library(stringr)
#text to lower case
df$text <- tolower(df$text)
#remove all special characters
df$text <- gsub("[[:punct:]]", " ", df$text)
#remove numbers
df$text <- gsub('[0-9]+', '', df$text)
#more in order to remove regular expressions like chinese characters
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
#remove long spaces
df$text <- gsub("\\s+"," ",str_trim(df$text))
This is the step I to make ngrams and also remove from the input text english stopwords in combination with my stopwords list.
myDfm <- dfm(tokens_remove(tokens(df$text, remove_punct = TRUE), c(stopwords("SMART"), mystop$stop)), ngrams = c(1,3))
However if I convert the myDfm to dataset to see if the removal of stopwords worked and can see them again
df_ngram <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm),
row.names = NULL, stringsAsFactors = FALSE)