0
votes

I want to remove punctuations, numbers and http links in text from data.frame file. I tried tm, stringr, quanteda, tidytext packages but none of them worked. I m looking for a useful basic package or function for clean data.frame file without convert it to corpus or something like that.

How can I do it?

mycorpus <- tm_map(mycorpus, content_transformer(remove_url)) Warning message: In tm_map.SimpleCorpus(mycorpus, content_transformer(remove_url)) : transformation drops documents

mycorpus <- tm_map(mycorpus, removePunctuation) Warning message: In tm_map.SimpleCorpus(mycorpus, removePunctuation) : transformation drops documents

And, when I try to see some tweets which contains any symbol: Error in nchar(output) : invalid multibyte string, element 1

mycorpus <- tm_map(mycorpus, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid input

2
What exactly have you tried? Please see here on making an R post we can help with. That includes a representative sample of data, code that hasn't worked, and expected output.camille
Welcome to SO. it is always recommended to post samples of Input and expected output in your post with code tags.RavinderSingh13
> mycorpus <- tm_map(mycorpus, content_transformer(remove_url)) Warning message: In tm_map.SimpleCorpus(mycorpus, content_transformer(remove_url)) : transformation drops documents > mycorpus <- tm_map(mycorpus, removePunctuation) Warning message: In tm_map.SimpleCorpus(mycorpus, removePunctuation) : transformation drops documents And, when I try to see some tweets which contains any symbol: Error in nchar(output) : invalid multibyte string, element 1 > mycorpus <- tm_map(mycorpus, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid inputFatih Bayrak
Please provide a shortened example of your data we can work with. Otherwise we have to keep guessing.Manuel Bickel
You might take another look at unnest_tokens from tidytext, which now has a token = "tweets" option that may be a good fit for you. It has options including strip_punct = TRUE and strip_url = TRUE.Julia Silge

2 Answers

3
votes

Since you haven't posted any sample input or sample output so couldn't test it, for removing punctuation, digits and http links from your data frame's specific column you could try following once.

gsub("[[:punct:]]|[[:digit:]]|^http:\\/\\/.*|^https:\\/\\/.*","",df$column)

OR as per Rui's suggestion in comments use following too.

gsub("[[:punct:]]|[[:digit:]]|(http[[:alpha:]]*:\\/\\/)","",df$column)
0
votes

A concise version may be achieved if you aim at keeping only characters as follows by replacing everything that is not a character. Furthermore, I guess that you want to replace it by a blank because you mentioned something about corpus. Otherwise your addresses will be collapsed to noe long string (but maybe that is what you want - as stated you might provide an example).

x = c("https://stackguides.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r"
      , "http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r")

gsub("\\W|\\d|http\\w?", " ", x, perl = T)
# [1] "    stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r"
# [2] "    stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r"

 the same task for a data frame of  100000 rows
# make sure that your strings are not factors
df = data.frame(id = 1:1e5, url = rep(x, 1e5/2), stringsAsFactors = FALSE)
# df before replacement
df[1:4, ]
# id    url
# 1  1 https://stackguides.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 2  2  http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 3  3 https://stackguides.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 4  4  http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# apply replacement on a specific column and assign result back to this column
df$url = gsub("\\W|\\d|http\\w?", " ", df$url, perl = T)
# check output
df[1:4, ]
# id        url
# 1  1     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 2  2     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 3  3     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 4  4     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r