0
votes

Editing the question as directed by Tyler in the comments below.

As part of a larger text mining project, I have created a .csv file which has titles of books in the first column and the whole contents of the book in the second column as My goal is to create a word cloud consisting of top n (n = 100 or 200 or 1000 depending on how skewed the scores are going to be) most frequently repeated words in the text for each title after removing the common stop words in English (for which the R-tm (text mining) package has a beautiful function - removeStopwords). Hope this explains my problem better.

Problem statement:

My input is in the below format in a csv file:

title   text
1   <huge amount of text1>
2   <huge amount of text2>
3   <huge amount of text3>

Here's a MWE with similar data:

library(tm)
data(acq)
dat <- data.frame(title=names(acq[1:3]), text=unlist(acq[1:3]), row.names=NULL)

I would like to find out the top "n" terms by frequency appearing in the corresponding text for each title excluding the stop words. The ideal output would be a table in excel or csv that would look like:

title   term    frequency
1       ..       ..
1       ..       ..
1       
1       
1       
2       
2       
2       
2       
2       
3       
3       
3       ..      ..

Please guide if this could be accomplished R or Python. Anyone please?

5
You have been rather lazy asking what could have been a good question and the consequence is closing of the question. If I asked a mechanic to work on my car I wouldn't say "Imagine an engine right here, and some tires here..." Please spend some time making this reproducible as stated in the help guide. I'd certainly vote to reopen if you gave a reproducible example and cleaned up a bit. - Tyler Rinker
@TylerRinker Thanks for being supportive and for the help guides; being a noob, these definitely helped. I've edited the question as suggested. Hope what I'm trying to achieve is clear enough now. - koder
@neo question is clearer now (it wasn't that bad before). THe problem is your data isn't data. You're using <huge amount of text1> here and .. to represent data (though the .. isn't that bad because that's the desired out put not the data we need to solve the problem.). I'll edit to give you reproducible data. - Tyler Rinker

5 Answers

3
votes

In Python, you can use Counter from the collections module, and re to split the sentence at each word, giving you this:

>>> import re
>>> from collections import Counter
>>> t = "This is a sentence with many words. Some words are repeated"
>>> Counter(re.split(r'\W', t)).most_common()
[('words', 2), ('a', 1), ('', 1), ('sentence', 1), ('This', 1), ('many', 1), ('is', 1), ('Some', 1), ('repeated', 1), ('are', 1), ('with', 1)]
0
votes

In R:

dat <- read.csv("myFile")
splitPerRow <- strsplit(dat$text, "\\W")
tablePerRow <- lapply(splitPerRow, table)
tablePerRow <- lapply(tablePerRow, sort, TRUE)
tablePerRow <- lapply(tablePerRow, head, n) # set n to be the threshold on frequency rank

output <- data.frame(freq=unlist(tablePerRow),
                     title=rep(dat$title, times=sapply(tablePerRow, length))
                     term = unlist(lapply(tablePerRow, names))
                      )

Depending on the nature of the text, you might need to filter out non-word entries (as if text is "term1 term2, term3" you'll get an empty entry caused by the empty string between the comma and the space after term2.

0
votes

In base R:


## set up some data
words <- paste(LETTERS[1:3], letters[1:3], sep = "")
dat <- data.frame(title = 1:3, text = sapply(1:3, function(x){
  paste(sample(unlist(strsplit(words, " ")), 15, TRUE), collapse = " ")
  }))
dat$text <- as.character(dat$text)

## solve the problem
> tabs <- sapply(dat$text, function(x){
    table(unlist(strsplit(x, " ")))
    }, USE.NAMES = FALSE)
> data.frame(title = sort(rep(1:nrow(dat), 3)), 
             text = sort(rep(rownames(tabs))), 
             freq = c(tabs))

## title text freq
##     1   Aa    6
##     1   Bb    3
##     1   Cc    6
##     2   Aa    9
##     2   Bb    4
##     2   Cc    2
##     3   Aa    4
##     3   Bb    7
##     3   Cc    4
0
votes

This allows you to do what you're after:

library(qdap)
list_df2df(setNames(lapply(dat$text, freq_terms, top=10, 
    stopwords = Dolch), dat$title), "Title")

You can remove stop words and get top n terms with freq_terms but applied to each text. Then you can set the names and put it all together with list_df2df.

Here I use the qdapDictionaries:Dolch list for stopwords but use what ever vector you want. Also that if there's a tie for top ten words here all words at that level will be included.

##              Title           WORD FREQ
## 1   reut-00001.xml       computer    6
## 2   reut-00001.xml        company    4
## 3   reut-00001.xml           dlrs    4
## .
## .
## .
## .
## 112 reut-00003.xml        various    1
## 113 reut-00003.xml           week    1
## 114 reut-00003.xml         within    1
0
votes

In R you can use stringi package and stri_extract_all_charclass function to extract all letters from text:

 stri_extract_all_charclass(c("Ala ma; kota. Jaś nie ma go\n.To nic nie ma 123","abc dce"),"\\p{Lc}")
## [[1]]
## [1] "Ala"  "ma"   "kota" "Jaś"  "nie"  "ma"   "go"   "To"   "nic"  "nie"  "ma"  
## 
## [[2]]
## [1] "abc" "dce"

And then using table function you can count this words. You may also want to transform every word to lowercase -> stri_trans_tolower function

stri_extract_all_charclass(c("Ala ma; kota. Jaś nie ma go\n.To nic nie ma 123","abc dce"),"\\p{Lc}") -> temp
lapply(temp, table)
## [[1]]
## 
##  Ala   go  Jaś kota   ma  nic  nie   To 
##    1    1    1    1    3    1    2    1 

## [[2]]

## abc dce 
##   1   1