Filtering text from numbers and stopwords in R(not for tdm)

Question

I have text corpus.

mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)

How can I filter this text? I must delete:

1) all numbers

2) pass through the stop words

3) remove the brackets

I will not work with dtm, I need just clean this textdata from numbers and stopwords

sample data:

112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715

Jura,the are stopwords.

In an output I expect

  Tablet for cleaning hydraulic system

And the code you have tried so far, along with your expected result. This is a common text mining task, perhaps searching resources like library(tidytext) would get you going — Nate

jazzurro jazzurro · Accepted Answer · 2017-12-01T15:38:46

Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.

mydata <- data.frame(id = 1:2,
                     text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
                              "1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
                     stringsAsFactors = F)

library(dplyr)
library(tidytext)

data(stop_words)

mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))

#     id                              text
#  <int>                             <chr>
#1     1  tablet cleaning hydraulic system
#2     2 tablet cleaning mambojumbo system

Another way would be the following. In this case, I am not using unnest_tokens().

library(magrittr)
library(stringi)
library(tidytext)

data(stop_words)

gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
    foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
           paste(collapse = " ")
    foo}) %>%
unlist

#[1] "Tablet cleaning hydraulic system"  "Tablet cleaning mambojumbo system"

Filtering text from numbers and stopwords in R(not for tdm)

2 Answers