0
votes

I'm not very Linux savvy, and I have a very large text file (A couple of Gigs) and I would like to find the most frequent words (say top 50) and a count of how many times each word occurs, and have these figures output to a text file something like so:

2500 and

How can I do this using Awk? (It does not specifically have to be using Awk, but I'm using Cygwin on Windows 7 and I'm not sure what other things are available to do this sort of thing).

I have taken a look at this question: https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

Although as previously stated I'm not too familiar with Linux and piping etc, and would appreciate if someone could explain what each command does.

3
tr translates the complement (-c) of alphanumeric characters to newlines. sort brings like words together, then uniq -c produces one line for each different word with a count. sort -nr then sorts the counts numerically, largest to smallest, and head -10 gives the first 10 lines. In Unix variants, including Linux and Cygwin, the man command (for manual) gives reference for each command. Thus, man tr would give the manual page for tr, etc. - mpez0

3 Answers

1
votes

It all depends on your definition of a "word" but if we assume it's a contiguous sequence of case-insensitive alphabetic characters then one approach with GNU awk (which is the awk you get with cygwin) would be:

awk -v RS='[[:alpha:]]+' '
    RT { cnt[tolower(RT)]++ }
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (word in cnt) {
            print cnt[word], word
            if (++c == 50) {
                exit
            }
        }
    }
' file

When run on @dawgs' Tale of Two Cities example the above outputs:

8230 the
5067 and
4140 of
3651 to
3017 a
2660 in
...
440 when
440 been
428 which
399 them
385 what

Want to exclude 1 or 2-character filler words like of, to, a, and in above? Easy:

awk -v RS='[[:alpha:]]+' '
    length(RT)>2 { cnt[tolower(RT)]++ }
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (word in cnt) {
            print cnt[word], word
            if (++c == 50) {
                exit
            }
        }
    }
' pg98.txt
8230 the
5067 and
2011 his
1956 that
1774 was
1497 you
1358 with
....

With other awks it'd be a while(match()) substr() loop and the output piped to sort -n then head.

If that's not what you want then edit your question to include some sample input and expected output so we can help you.

0
votes

Here is a Python version:

from collections import Counter

wc=Counter()

with open('tale.txt') as f:
    for line in f:
        wc.update(line.split())

print wc.most_common(50)

Running that on Tale of Two Cities yields:

[('the', 7514), ('and', 4745), ('of', 4066), ('to', 3458), ('a', 2825), ('in', 2447), ('his', 1911), ('was', 1673), ('that', 1663), ('I', 1446), ('he', 1388), ('with', 1288), ('had', 1263), ('it', 1173), ('as', 1016), ('at', 978), ('you', 895), ('for', 868), ('on', 820), ('her', 818), ('not', 748), ('is', 713), ('have', 703), ('be', 701), ('were', 633), ('Mr.', 602), ('The', 587), ('said', 570), ('my', 568), ('by', 547), ('him', 525), ('from', 505), ('this', 465), ('all', 459), ('they', 446), ('no', 423), ('so', 420), ('or', 418), ('been', 415), ('"I', 400), ('but', 387), ('which', 375), ('He', 363), ('when', 354), ('an', 337), ('one', 334), ('out', 333), ('who', 331), ('if', 327), ('would', 327)]

You can also come up with a modular/Unix type solution with awk, sort and head:

$ awk '{for (i=1;i<=NF; i++){words[$i]++}}END{for (w in words) print words[w]"\t"w}' tale.txt | sort -n -r | head -n 50
7514    the
4745    and
4066    of
3458    to
2825    a
2447    in
...

No matter the language, the recipe is the same:

  1. Create an associative array of words and their frequency count
  2. Read the file line by line and add to the associative array word by word
  3. Sort the array frequency and print the desired number of entries.

You also need to think about what a 'word' is. In this case, I have simply uses space as a delimiter between blocks of non-space as a 'word'. That means that And and + "And are all different words. Separating punctuation is an additional step usually involving a regular expression.

0
votes

I created a file by copying this entire article . This awk one liner could be a start.

awk -v RS="[:punct:]" '{for(i=1;i<=NF;i++) words[$i]++;}END{for (i in words) print words[i]" "i}' file

Piece of out:

 1 exploration
 1 day
 1 staggering
 1 these
 2 into
 1 Africans
 4 across
 5 The
 1 head
 1 parasitic
 1 parasitized
 1 discovered
 1 To
 1 both
 1 what
 1 As
 1 inject
 1 hypodermic
 1 succumbing
 1 glass
 1 picked
 1 Observatory
 1 actually

The complete version. I use two files, one with the english stopwords and the file containing the words from which we want to extract the most frequently (50) words.

BEGIN {
    FS="[[:punct:] ]";
}
FNR==NR{
    stop_words[$1]++;
    next;
}
{
    for(i=1;i<=NF;i++)
    {
        if (stop_words[$i])
        {
            continue;
        }

        if ($i ~ /[[:alpha:]]+/)# add only if de value is alphabetical
        {
            words[$i]++;
        }
    }
}
END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (w in words)
    {
        count++;
        print words[w], w;
        if (count == 50)
        {
            break;
        }
    }
}

How to run it. awk -f script.awk english_stop_words.txt big_file.txt