3
votes

I have text corpus.

mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)

How can I filter this text? I must delete:

1) all numbers

2) pass through the stop words

3) remove the brackets

I will not work with dtm, I need just clean this textdata from numbers and stopwords

sample data:

112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715

Jura,the are stopwords.

In an output I expect

  Tablet for cleaning hydraulic system 
2
Would you provide a sample data?jazzurro
And the code you have tried so far, along with your expected result. This is a common text mining task, perhaps searching resources like library(tidytext) would get you goingNate
@jazzurro, i edited post with sample dataD.Joe
@Nate, i edited post with that i expectD.Joe
Do you need to remove anything in ()?jazzurro

2 Answers

5
votes

Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.

mydata <- data.frame(id = 1:2,
                     text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
                              "1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
                     stringsAsFactors = F)

library(dplyr)
library(tidytext)

data(stop_words)

mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))

#     id                              text
#  <int>                             <chr>
#1     1  tablet cleaning hydraulic system
#2     2 tablet cleaning mambojumbo system

Another way would be the following. In this case, I am not using unnest_tokens().

library(magrittr)
library(stringi)
library(tidytext)

data(stop_words)

gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
    foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
           paste(collapse = " ")
    foo}) %>%
unlist

#[1] "Tablet cleaning hydraulic system"  "Tablet cleaning mambojumbo system"
2
votes

There are multiple ways of doing this. If you want to rely on base R only, you can transform @jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.

I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:

mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)

custom_filter <- function(string, stopwords=c()){
  string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
  # Create something like:  "\\b( the|Jura)\\b"
  new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
  gsub(new_regex, "", string)
}

stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system  "