Here is a Python version:
from collections import Counter
wc=Counter()
with open('tale.txt') as f:
for line in f:
wc.update(line.split())
print wc.most_common(50)
Running that on Tale of Two Cities yields:
[('the', 7514), ('and', 4745), ('of', 4066), ('to', 3458), ('a', 2825), ('in', 2447), ('his', 1911), ('was', 1673), ('that', 1663), ('I', 1446), ('he', 1388), ('with', 1288), ('had', 1263), ('it', 1173), ('as', 1016), ('at', 978), ('you', 895), ('for', 868), ('on', 820), ('her', 818), ('not', 748), ('is', 713), ('have', 703), ('be', 701), ('were', 633), ('Mr.', 602), ('The', 587), ('said', 570), ('my', 568), ('by', 547), ('him', 525), ('from', 505), ('this', 465), ('all', 459), ('they', 446), ('no', 423), ('so', 420), ('or', 418), ('been', 415), ('"I', 400), ('but', 387), ('which', 375), ('He', 363), ('when', 354), ('an', 337), ('one', 334), ('out', 333), ('who', 331), ('if', 327), ('would', 327)]
You can also come up with a modular/Unix type solution with awk
, sort
and head
:
$ awk '{for (i=1;i<=NF; i++){words[$i]++}}END{for (w in words) print words[w]"\t"w}' tale.txt | sort -n -r | head -n 50
7514 the
4745 and
4066 of
3458 to
2825 a
2447 in
...
No matter the language, the recipe is the same:
- Create an associative array of words and their frequency count
- Read the file line by line and add to the associative array word by word
- Sort the array frequency and print the desired number of entries.
You also need to think about what a 'word' is. In this case, I have simply uses space as a delimiter between blocks of non-space as a 'word'. That means that And
and
+ "And
are all different words. Separating punctuation is an additional step usually involving a regular expression.
tr
translates the complement (-c
) of alphanumeric characters to newlines.sort
brings like words together, thenuniq -c
produces one line for each different word with a count.sort -nr
then sorts the counts numerically, largest to smallest, andhead -10
gives the first 10 lines. In Unix variants, including Linux and Cygwin, theman
command (for manual) gives reference for each command. Thus,man tr
would give the manual page fortr
, etc. - mpez0