Input large text file and output most frequent words text file using Awk?

Question

I'm not very Linux savvy, and I have a very large text file (A couple of Gigs) and I would like to find the most frequent words (say top 50) and a count of how many times each word occurs, and have these figures output to a text file something like so:

2500 and

How can I do this using Awk? (It does not specifically have to be using Awk, but I'm using Cygwin on Windows 7 and I'm not sure what other things are available to do this sort of thing).

I have taken a look at this question: https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

Although as previously stated I'm not too familiar with Linux and piping etc, and would appreciate if someone could explain what each command does.

tr translates the complement (-c) of alphanumeric characters to newlines. sort brings like words together, then uniq -c produces one line for each different word with a count. sort -nr then sorts the counts numerically, largest to smallest, and head -10 gives the first 10 lines. In Unix variants, including Linux and Cygwin, the man command (for manual) gives reference for each command. Thus, man tr would give the manual page for tr, etc. — mpez0

Ed Morton Ed Morton · Accepted Answer · 2016-03-02T18:32:18

It all depends on your definition of a "word" but if we assume it's a contiguous sequence of case-insensitive alphabetic characters then one approach with GNU awk (which is the awk you get with cygwin) would be:

awk -v RS='[[:alpha:]]+' '
    RT { cnt[tolower(RT)]++ }
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (word in cnt) {
            print cnt[word], word
            if (++c == 50) {
                exit
            }
        }
    }
' file

When run on @dawgs' Tale of Two Cities example the above outputs:

8230 the
5067 and
4140 of
3651 to
3017 a
2660 in
...
440 when
440 been
428 which
399 them
385 what

Want to exclude 1 or 2-character filler words like of, to, a, and in above? Easy:

awk -v RS='[[:alpha:]]+' '
    length(RT)>2 { cnt[tolower(RT)]++ }
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (word in cnt) {
            print cnt[word], word
            if (++c == 50) {
                exit
            }
        }
    }
' pg98.txt
8230 the
5067 and
2011 his
1956 that
1774 was
1497 you
1358 with
....

With other awks it'd be a while(match()) substr() loop and the output piped to sort -n then head.

If that's not what you want then edit your question to include some sample input and expected output so we can help you.

Input large text file and output most frequent words text file using Awk?

3 Answers