Editing the question as directed by Tyler in the comments below.
As part of a larger text mining project, I have created a .csv file which has titles of books in the first column and the whole contents of the book in the second column as My goal is to create a word cloud consisting of top n (n = 100 or 200 or 1000 depending on how skewed the scores are going to be) most frequently repeated words in the text for each title after removing the common stop words in English (for which the R-tm (text mining) package has a beautiful function - removeStopwords). Hope this explains my problem better.
Problem statement:
My input is in the below format in a csv file:
title text
1 <huge amount of text1>
2 <huge amount of text2>
3 <huge amount of text3>
Here's a MWE with similar data:
library(tm)
data(acq)
dat <- data.frame(title=names(acq[1:3]), text=unlist(acq[1:3]), row.names=NULL)
I would like to find out the top "n" terms by frequency appearing in the corresponding text for each title excluding the stop words. The ideal output would be a table in excel or csv that would look like:
title term frequency
1 .. ..
1 .. ..
1
1
1
2
2
2
2
2
3
3
3 .. ..
Please guide if this could be accomplished R or Python. Anyone please?
<huge amount of text1>here and..to represent data (though the..isn't that bad because that's the desired out put not the data we need to solve the problem.). I'll edit to give you reproducible data. - Tyler Rinker