Calculating word frequencies in Python or R

Question

Editing the question as directed by Tyler in the comments below.

As part of a larger text mining project, I have created a .csv file which has titles of books in the first column and the whole contents of the book in the second column as My goal is to create a word cloud consisting of top n (n = 100 or 200 or 1000 depending on how skewed the scores are going to be) most frequently repeated words in the text for each title after removing the common stop words in English (for which the R-tm (text mining) package has a beautiful function - removeStopwords). Hope this explains my problem better.

Problem statement:

My input is in the below format in a csv file:

title   text
1   <huge amount of text1>
2   <huge amount of text2>
3   <huge amount of text3>

Here's a MWE with similar data:

library(tm)
data(acq)
dat <- data.frame(title=names(acq[1:3]), text=unlist(acq[1:3]), row.names=NULL)

I would like to find out the top "n" terms by frequency appearing in the corresponding text for each title excluding the stop words. The ideal output would be a table in excel or csv that would look like:

title   term    frequency
1       ..       ..
1       ..       ..
1       
1       
1       
2       
2       
2       
2       
2       
3       
3       
3       ..      ..

Please guide if this could be accomplished R or Python. Anyone please?

You have been rather lazy asking what could have been a good question and the consequence is closing of the question. If I asked a mechanic to work on my car I wouldn't say "Imagine an engine right here, and some tires here..." Please spend some time making this reproducible as stated in the help guide. I'd certainly vote to reopen if you gave a reproducible example and cleaned up a bit. — Tyler Rinker
@TylerRinker Thanks for being supportive and for the help guides; being a noob, these definitely helped. I've edited the question as suggested. Hope what I'm trying to achieve is clear enough now. — koder
@neo question is clearer now (it wasn't that bad before). THe problem is your data isn't data. You're using <huge amount of text1> here and .. to represent data (though the .. isn't that bad because that's the desired out put not the data we need to solve the problem.). I'll edit to give you reproducible data. — Tyler Rinker

Burhan Khalid Burhan Khalid · Accepted Answer · 2014-03-21T09:24:06

In Python, you can use Counter from the collections module, and re to split the sentence at each word, giving you this:

>>> import re
>>> from collections import Counter
>>> t = "This is a sentence with many words. Some words are repeated"
>>> Counter(re.split(r'\W', t)).most_common()
[('words', 2), ('a', 1), ('', 1), ('sentence', 1), ('This', 1), ('many', 1), ('is', 1), ('Some', 1), ('repeated', 1), ('are', 1), ('with', 1)]

Calculating word frequencies in Python or R

5 Answers