Build an ASCII chart of the most commonly used words in a given text [closed]

156

votes

The challenge:

Build an ASCII chart of the most commonly used words in a given text.

The rules:

Only accept a-z and A-Z (alphabetic characters) as part of a word.
Ignore casing (She == she for our purpose).
Ignore the following words (quite arbitary, I know): the, and, of, to, a, i, it, in, or, is
Clarification: considering don't: this would be taken as 2 different 'words' in the ranges a-z and A-Z: (don and t).
Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).

Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii) and build us a word frequency chart with the following characteristics:

Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
The bar width represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.
Make sure these bars (plus space-word-space) always fit: bar + [space] + word + [space] should be always <= 80 characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).

An example:

The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).

This specific text would yield the following chart:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|______________________________________________| was 
|__________________________________________| that 
|___________________________________| as 
|_______________________________| her 
|____________________________| with 
|____________________________| at 
|___________________________| s 
|___________________________| t 
|_________________________| on 
|_________________________| all 
|______________________| this 
|______________________| for 
|______________________| had 
|_____________________| but 
|____________________| be 
|____________________| not 
|___________________| they 
|__________________| so

For your information: these are the frequencies the above chart is built upon:

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

A second example (to check if you implemented the complete spec): Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring:

 ________________________________________________________________
|________________________________________________________________| she 
|_______________________________________________________| superlongstringstring 
|_____________________________________________________| said 
|______________________________________________| alice 
|________________________________________| was 
|_____________________________________| that 
|______________________________| as 
|___________________________| her 
|_________________________| with 
|_________________________| at 
|________________________| s 
|________________________| t 
|______________________| on 
|_____________________| all 
|___________________| this 
|___________________| for 
|___________________| had 
|__________________| but 
|_________________| be 
|_________________| not 
|________________| they 
|________________| so

The winner:

Shortest solution (by character count, per language). Have fun!

Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):

Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     243
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                 450
F#                          452
TSQL                483     507

The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).

language-agnosticcode-golf

Well, 'longest bar'+word=80 may not fit within 80 cols if second-most-common-word is a much longer word. Am looking for the 'max constraint' I guess. – Brian

Do we normalize casing? 'She' = 'she'? – Brian

IMO making this perform, both in terms of execution time and memory usage, seems like a more interesting challenge than character count. – Frank Farmer

I'm glad to see that my favorite words s and t are represented. – indiv

@indiv, @Nas Banov -- silly too-simple tokenizer reads "didn't" as {didn, t} and "she's" as {she, s} :) – hobbs

30 Answers

123

votes

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

The program flows from left to right:

votes

Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

Edit: Replaced semicolons by line breaks for "readability". :P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

Edit 3: Removed the trailing space again ;)

votes

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/

Explanation:

{           #loop through all characters
 32|.       #convert to uppercase and duplicate
 123%97<    #determine if is a letter
 n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
            #the 1 is the current number of occurrences of the first word
{           #loop through the array
 .3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~\;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76\-:1     #store the length of the first line
'_':0*' '\@ #make the first line
{           #loop through each word
"
|"\~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

"Correct" (hopefully). (143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/

Output visible in revision logs.

votes

206 shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt

~~hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)~~
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'

for fun, here's a perl-only version (much faster):

~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt

votes

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

NB: Line breaks added to avoid scrollbars only the last line break is required.

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

DECLARE @  VARCHAR(MAX),
        @F REAL
SELECT @=BulkColumn
FROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path
                                             C:\WINDOWS\system32\A  */

/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
     (SELECT 1 i,
             LEFT(@,1)L

     UNION ALL

     SELECT i+1,
            SUBSTRING(@,i+1,1)
     FROM   N
     WHERE  i<LEN(@)
     )
  SELECT   i,
           L,
           i-RANK()OVER(ORDER BY i)R
           /*Will group characters
           from the same word together*/
  INTO     #D
  FROM     N
  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
             /*Assuming case insensitive accent sensitive collation*/

SELECT   TOP 22 W,
         -COUNT(*)C
INTO     #
FROM     (SELECT DISTINCT R,
                          (SELECT ''+L
                          FROM    #D
                          WHERE   R=b.R FOR XML PATH('')
                          )W
                          /*Reconstitute the word from the characters*/
         FROM             #D b
         )
         T
WHERE    LEN(W)>1
AND      W NOT IN('the',
                  'and',
                  'of' ,
                  'to' ,
                  'it' ,
                  'in' ,
                  'or' ,
                  'is')
GROUP BY W
ORDER BY C

/*Just noticed this looks risky as it relies on the order of evaluation of the 
 variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' '
FROM   #

SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W
             FROM     #
             ORDER BY C

PRINT @

Output

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what

And with the long string

 _______________________________________________________________ 
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what

votes

Ruby 207 213 211 210 207 203 201 200 chars

An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

Execute as:

ruby GolfedWordFrequencies.rb < Alice.txt

Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read.

votes

Mathematica (297 284 248 244 242 199 chars) Pure Functional

and Zipf's Law Testing

Look Mamma ... no vars, no hands, .. no head

Edit 1> some shorthands defined (284 chars)

f[x_, y_] := Flatten[Take[x, All, y]]; 

BarChart[f[{##}, -1], 
         BarOrigin -> Left, 
         ChartLabels -> Placed[f[{##}, 1], After], 
         Axes -> None
] 
& @@
Take[
  SortBy[
     Tally[
       Select[
        StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]], 
       !MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]
     ], 
  Last], 
-22]

Some explanations

Import[] 
   # Get The File

ToLowerCase []
   # To Lower Case :)

StringSplit[ STRING , RegularExpression["\\W+"]]
   # Split By Words, getting a LIST

Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]
   #  Select from LIST except those words in LIST_TO_AVOID
   #  Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test

Tally[LIST]
   # Get the LIST {word,word,..} 
     and produce another  {{word,counter},{word,counter}...}

SortBy[ LIST ,Last]
   # Get the list produced bt tally and sort by counters
     Note that counters are the LAST element of {word,counter}

Take[ LIST ,-22]
   # Once sorted, get the biggest 22 counters

BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST
   # Get the list produced by Take as input and produce a bar chart

f[x_, y_] := Flatten[Take[x, All, y]]
   # Auxiliary to get the list of the first or second element of lists of lists x_
     dependending upon y
   # So f[{##}, -1] is the list of counters
   # and f[{##}, 1] is the list of words (labels for the chart)

Output

alt text http://i49.tinypic.com/2n8mrer.jpg

Mathematica is not well suited for golfing, and that is just because of the long, descriptive function names. Functions like "RegularExpression[]" or "StringSplit[]" just make me sob :(.

Zipf's Law Testing

The Zipf's law predicts that for a natural language text, the Log (Rank) vs Log (occurrences) Plot follows a linear relationship.

The law is used in developing algorithms for criptography and data compression. (But it's NOT the "Z" in the LZW algorithm).

In our text, we can test it with the following

 f[x_, y_] := Flatten[Take[x, All, y]]; 
 ListLogLogPlot[
     Reverse[f[{##}, -1]], 
     AxesLabel -> {"Log (Rank)", "Log Counter"}, 
     PlotLabel -> "Testing Zipf's Law"]
 & @@
 Take[
  SortBy[
    Tally[
       StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]
    ], 
   Last],
 -1000]

The result is (pretty well linear)

alt text http://i46.tinypic.com/33fcmdk.jpg

Edit 6 > (242 Chars)

Refactoring the Regex (no Select function anymore)
Dropping 1 char words
More efficient definition for function "f"

f = Flatten[Take[#1, All, #2]]&; 
BarChart[
     f[{##}, -1], 
     BarOrigin -> Left, 
     ChartLabels -> Placed[f[{##}, 1], After], 
     Axes -> None] 
& @@
  Take[
    SortBy[
       Tally[
         StringSplit[ToLowerCase[Import[i]], 
          RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]
       ],
    Last],
  -22]

Edit 7 → 199 characters

BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@ 
  Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i, 
    RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]

Replaced f with Transpose and Slot (#1/#2) arguments.
We don't need no stinkin' brackets (use f@x instead of f[x] where possible)

votes

C# - 510 451 436 446 434 426 422 chars (minified)

Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.

using C=System.Console;   // alias for Console
using System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.

class Class // must define a class
{
    static void Main()  // must define a Main
    {
        // split into words
        var allwords = System.Text.RegularExpressions.Regex.Split(
                // convert stdin to lowercase
                C.In.ReadToEnd().ToLower(),
                // eliminate stopwords and non-letters
                @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")
            .GroupBy(x => x)    // group by words
            .OrderBy(x => -x.Count()) // sort descending by count
            .Take(22);   // take first 22 words

        // compute length of longest bar + word
        var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));

        // prepare text to print
        var toPrint = allwords.Select(x=> 
            new { 
                // remember bar pseudographics (will be used in two places)
                Bar = new string('_',(int)(x.Count()/lendivisor)), 
                Word=x.Key 
            })
            .ToList();  // convert to list so we can index into it

        // print top of first bar
        C.WriteLine(" " + toPrint[0].Bar);
        toPrint.ForEach(x =>  // for each word, print its bar and the word
            C.WriteLine("|" + x.Bar + "| " + x.Word));
    }
}

422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):

using System.Linq;using C=System.Console;class M{static void Main(){var
a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var
b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}

votes

Perl, 237 229 209 chars

(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc with lc=~/[a-z]+/g, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)

Update: now with Perl 5.10! Replace print with say, and use ~~ to avoid a map. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc (for lower-casing) requires you to add A-Z to the split regex, so it's a wash.

If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.

Here is a mostly correct, but not remotely short enough, perl solution:

use strict;
use warnings;

my %short = map { $_ => 1 } qw/the and of to a i it in or is/;
my %count = ();

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);
my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
my $widest = 76 - (length $sorted[0]);

print " " . ("_" x $widest) . "\n";
foreach (@sorted)
{
    my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);
    print "|" . ("_" x $width) . "| $_ \n";
}

The following is about as short as it can get while remaining relatively readable. (392 chars).

%short = map { $_ => 1 } qw/the and of to a i it in or is/;
%count;

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);
@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
$widest = 76 - (length $sorted[0]);

print " " . "_" x $widest . "\n";
print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;

votes

Windows PowerShell, 199 chars

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *
filter f($w){' '+'_'*$w
$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}
f(76..1|?{!((f $_)-match'.'*80)})[0]

(The last line break isn't necessary, but included here for readability.)

(Current code and my test files available in my SVN repository. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))

Assumptions:

US ASCII as input. It probably gets weird with Unicode.
At least two non-stop words in the text

History

Relaxed version (137), since that's counted separately by now, apparently:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}

doesn't close the first bar
doesn't account for word length of non-first word

Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.

Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.

An older version explained can be found here.

votes

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

update 1: Hurray! It's a tie with JS Bangs' solution. Can't think of a way to cut down any more :)

update 2: Played a dirty golf trick. Changed each to map to save 1 character :)

update 3: Changed File.read to IO.read +2. Array.group_by wasn't very fruitful, changed to reduce +6. Case insensitive check is not needed after lower casing with downcase in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15

update 4: [0] rather than .first, +3. (@Shtééf)

update 5: Expand variable l in-place, +1. Expand variable s in-place, +2. (@Shtééf)

update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

Readable version

string = File.read($_).downcase

words = string.scan(/[a-z]+/i)
allowed_words = words - %w{the and of to a i it in or is}
sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)
highest_frequency = sorted_words.first
highest_frequency_count = highest_frequency[1]
highest_frequency_word = highest_frequency[0]

word_length = highest_frequency_word.size
widest = 76 - word_length

puts " #{'_' * widest}"    
sorted_words.each do |word, freq|
  width = (freq * 1.0 / highest_frequency_count) * widest
  puts "|#{'_' * width}| #{word} "
end

To use:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

votes

Python 2.x, latitudinarian approach = 227 183 chars

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]
for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is) - plus it also excludes the two infamous "words" s and t from the example - and I threw in for free the exclusion for an, for, he. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis and andithetoforinis.

PS. borrowed from other solutions to shorten the code.

=========================================================================== she 
================================================================= you
============================================================== said
====================================================== alice
================================================ was
============================================ that
===================================== as
================================= her
============================== at
============================== with
=========================== on
=========================== all
======================== this
======================== had
======================= but
====================== be
====================== not
===================== they
==================== so
=================== very
=================== what
================= little

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).

Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]
h=min(9*l/(77-len(w))for l,w in r)
print'',9*r[0][0]/h*'_'
for l,w in r:print'|'+9*l/h*'_'+'|',w

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|_________________________________________| was
|______________________________________| that
|_______________________________| as
|____________________________| her
|__________________________| at
|__________________________| with
|_________________________| s
|_________________________| t
|_______________________| on
|_______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|___________________| not
|_________________| they
|_________________| so

votes

Haskell - 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
  .t(`notElem`words"the and of to a i it in or is").words.m f

How it works is best seen by reading the argument to interact backwards:

map f lowercases alphabetics, replaces everything else with spaces.
words produces a list of words, dropping the separating whitespace.
filter (notElemwords "the and of to a i it in or is") discards all entries with forbidden words.
group . sort sorts the words, and groups identical ones into lists.
map h maps each list of identical words to a tuple of the form (-frequency, word).
take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
b maps tuples to bars (see below).
a prepends the first line of underscores, to complete the topmost bar.
unlines joins all these lines together with newlines.

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps c x over x, where x is the list of histograms. The entire list is passed to c, so that each invocation of c can compute the scale factor for itself by calling u. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

Note the trick of using -frequency. This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u, two -frequency values are multiplied, which will cancel the negation out.

votes

JavaScript 1.8 (SpiderMonkey) - 354

x={};p='|';e=' ';z=[];c=77
while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))
z=z.sort(function(a,b)b.c-a.c).slice(0,22)
for each(v in z){v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines....

Adding whitespace for readability:

x={};p='|';e=' ';z=[];c=77
while(l=readline())
  l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,
   function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )
  )
z=z.sort(function(a,b) b.c - a.c).slice(0,22)
for each(v in z){
  v.r=v.c/z[0].c
  c=c>(l=(77-v.w.length)/v.r)?l:c
}
for(k in z){
  v=z[k]
  s=Array(v.r*c|0).join('_')
  if(!+k)print(e+s+e)
  print(p+s+p+e+v.w)
}

Usage: js golf.js < input.txt

Output:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| for
|______________________| had
|______________________| but
|_____________________| be
|_____________________| not
|___________________| they
|___________________| so

(base version - doesn't handle bar widths correctly)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

~~I think my sorting logic is off, but.. I duno.~~ Brainfart fixed.

Minified (abusing \n's interpreted as a ; sometimes):

x={};p='|';e=' ';z=[]
readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})
z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)
for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

votes

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

Usage: php.exe <this.php> <file.txt>

Minified:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Human readable:

<?php

// Read:
$s = strtolower(file_get_contents($argv[1]));

// Split:
$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);

// Remove unwanted words:
$a = array_filter($a, function($x){
       return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);
     });

// Count:
$a = array_count_values($a);

// Sort:
arsort($a);

// Pick top 22:
$a=array_slice($a,0,22);


// Recursive function to adjust bar widths
// according to the last requirement:
function R($a,$F,$B){
    $r = array();
    foreach($a as $x=>$f){
        $l = strlen($x);
        $r[$x] = $b = $f * $B / $F;
        if ( $l + $b > 76 )
            return R($a,$f,76-$l);
    }
    return $r;
}

// Apply the function:
$c = R($a,max($a),76-strlen(key($a)));


// Output:
foreach ($a as $x => $f)
    echo '|',str_repeat('-',$c[$x]),"| $x\n";

?>

Output:

|-------------------------------------------------------------------------| she
|---------------------------------------------------------------| you
|------------------------------------------------------------| said
|-----------------------------------------------------| alice
|-----------------------------------------------| was
|-------------------------------------------| that
|------------------------------------| as
|--------------------------------| her
|-----------------------------| at
|-----------------------------| with
|--------------------------| on
|--------------------------| all
|-----------------------| this
|-----------------------| for
|-----------------------| had
|-----------------------| but
|----------------------| be
|---------------------| not
|--------------------| they
|--------------------| so
|-------------------| very
|------------------| what

When there is a long word, the bars are adjusted properly:

|--------------------------------------------------------| she
|---------------------------------------------------| thisisareallylongwordhere
|-------------------------------------------------| you
|-----------------------------------------------| said
|-----------------------------------------| alice
|------------------------------------| was
|---------------------------------| that
|---------------------------| as
|-------------------------| her
|-----------------------| with
|-----------------------| at
|--------------------| on
|--------------------| all
|------------------| this
|------------------| for
|------------------| had
|-----------------| but
|-----------------| be
|----------------| not
|---------------| they
|---------------| so
|--------------| very

votes

Python 3.1 - 245 229 charaters

I guess using Counter is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.

import re,collections
o=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)
print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

Prints out:

|____________________________________________________________________________| she
|__________________________________________________________________| you
|_______________________________________________________________| said
|_______________________________________________________| alice
|_________________________________________________| was
|_____________________________________________| that
|_____________________________________| as
|__________________________________| her
|_______________________________| with
|_______________________________| at
|______________________________| s
|_____________________________| t
|____________________________| on
|___________________________| all
|________________________| this
|________________________| for
|________________________| had
|________________________| but
|______________________| be
|______________________| not
|_____________________| they
|____________________| so

Some of the code was "borrowed" from AKX's solution.

votes

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

Original:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s
",'_'x$l;printf"|%s| $_
",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

~~Latest version down to~~ 191 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@e[0,0..21]

Latest version down to 189 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@_[0,0..21]

This version (205 char) accounts for the lines with words longer than what would be found later.

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
";}@e[0,0..21]

votes

Perl: 203 202 201 198 195 208 203 / 231 chars

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)

PS: Reddit says "Hi" ;)

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.

Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.

Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

Examples:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

Alternative implementation in pathological case example:

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| with
|_________________________| at
|_______________________| on
|______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| so
|________________| very
|________________| what

votes

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k, then print results.

let a=
 stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1)
 |>Seq.map(fun s->s.ToLower())|>Seq.countBy id
 |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w))
 |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22
let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min
let u n=String.replicate(int(float(n)*k)-2)"_"
printfn" %s "(u(snd(Seq.nth 0 a)))
for(w,n)in a do printfn"|%s| %s "(u n)w

Example (I have different freq counts than you, unsure why):

% app.exe < Alice.txt

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| had
|______________________| for
|_____________________| but
|_____________________| be
|____________________| not
|___________________| they
|__________________| so

votes

Python 2.6, 347 chars

import re
W,x={},"a and i in is it of or the to".split()
[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]
W=sorted(W.items(),key=lambda p:p[1])[:22]
bm=(76.-len(W[0][0]))/W[0][1]
U=lambda n:"_"*int(n*bm)
print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

votes

sh (+curl), partial* solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22

votes

Gawk -- 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solution^{counter challenge! ;)}and [AKX's python][2].

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

Minified:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}
END{split("the and of to a i it in or is",b," ");
for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}
for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;
t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;
print"|"t"| "x;delete a[x]}}

line breaks for clarity only: they are not necessary and should not be counted.

Output:

$ gawk -f wordfreq.awk.min < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min
 ______________________________________________________________________
|______________________________________________________________________| she
|_____________________________________________________________| superlongstring
|__________________________________________________________| said
|__________________________________________________| alice
|____________________________________________| was
|_________________________________________| that
|_________________________________| as
|______________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|__________________________| t
|________________________| on
|________________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|_________________| so

Readable; 633 characters (originally 949):

{
    gsub("[^a-zA-Z]"," ");
    for(;NF;NF--)
    a[tolower($NF)]++
}
END{
    # remove "short" words
    split("the and of to a i it in or is",b," ");
    for (w in b) 
    delete a[b[w]];
    # Find the bar ratio
    d=1;
    for (w in a) {
    e=a[w]/(78-length(w));
    if (e>d)
        d=e
    }
    # Print the entries highest count first
    for (i=22; i; --i){               
    # find the highest count
    e=0;
    for (w in a) 
        if (a[w]>e)
        e=a[x=w];
        # Print the bar
    l=a[x]/d-2;
    # make a string of "_" the right length
    t=sprintf(sprintf("%%%dc",l)," ");
    gsub(" ","_",t);
    if (i==22) print" "t;
    print"|"t"| "x;
    delete a[x]
    }
}

votes

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(
make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda
(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test
'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y
(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-
76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))
(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)
(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))
(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(
reverse w))c 0))(setf w nil)))))

can be run on for example with cat alice.txt | clisp -C golf.lisp.

In readable form is

(flet ((r () (let ((x (read-char t nil)))
               (and x (char-downcase x)))))
  (do ((c (make-hash-table :test 'equal))  ; the word count map
       w y                                 ; current word and final word list
       (x (r) (r)))  ; iteration over all chars
       ((not x)

        ; make a list with (word . count) pairs removing stopwords
        (maphash (lambda (k v)
                   (if (not (find k '("" "the" "and" "of" "to"
                                      "a" "i" "it" "in" "or" "is")
                                  :test 'equal))
                       (push (cons k v) y)))
                 c)

        ; sort and truncate the list
        (setf y (sort y #'> :key #'cdr))
        (setf y (subseq y 0 (min (length y) 22)))

        ; find the scaling factor
        (let ((f (apply #'min
                        (mapcar (lambda (x) (/ (- 76.0 (length (car x)))
                                               (cdr x)))
                                y))))
          ; output
          (flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))
             (write-char #\Space)
             (outx (cdar y))
             (write-char #\Newline)
             (dolist (x y)
               (write-char #\|)
               (outx (cdr x))
               (format t "| ~a~%" (car x))))))

       ; add alphabetic to current word, and bump word counter
       ; on non-alphabetic
       (cond
        ((char<= #\a x #\z)
         (push x w))
        (t
         (incf (gethash (concatenate 'string (reverse w)) c 0))
         (setf w nil)))))

votes

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

It does not handle failures and it does not release used memory.

#include <glib.h>
#define S(X)g_string_##X
#define H(X)g_hash_table_##X
GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}

votes

Perl, 185 char

~~200 (slightly broken)~~ ~~199~~ ~~197~~ ~~195~~ ~~193~~ ~~187~~ 185 characters. Last two newlines are significant. Complies with the spec.

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;
$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];
die map{$U='_'x($X{$_}/$n);" $U
"x!$z++,"|$U| $_
"}@w

First line loads counts of valid words into %X.

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

The third line (contains two newline characters) produces the output.

votes

Java - 886 865 756 744 742 744 752 742 714 680 chars

Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.
Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s in regex replaced by and ArrayList replaced by Vector). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.
Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.
Update 752 > 742 chars: I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.
Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).
Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll().

import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

Now as a script (a.scala):

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22
def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt
println(" "+b(t(0)._2))
for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

Run with

scala -howtorun:script a.scala alice.txt

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

votes

Clojure 282 strict

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

Somewhat more legibly:

(let[[[_ m]:as s](->> (slurp *in*)
                   .toLowerCase
                   (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")
                   frequencies
                   (sort-by val >)
                   (take 22))
     [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))
     p #(do
          (print %1)
          (dotimes[_(* b %2)] (print \_))
          (apply println %&))]
  (p " " m)
  (doseq[[k v] s] (p \| v \| k)))

votes

Scala, 368 chars

First, a legible version in 592 characters:

object Alice {
  def main(args:Array[String]) {
    val s = io.Source.fromFile(args(0))
    val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)
    val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))
    val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)
    val top22 = sortedFreqs.take(22)
    val highestWord = top22.head._1
    val highestCount = top22.head._2
    val widest = 76 - highestWord.length
    println(" " + "_" * widest)
    top22.foreach(t => {
      val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt
      println("|" + "_" * width + "| " + t._1)
    })
  }
}

The console output looks like this:

$ scalac alice.scala 
$ scala Alice aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

We can do some aggressive minifying and get it down to 415 characters:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

The console session looks like this:

$ scalac a.scala 
$ scala A aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

I'm sure a Scala expert could do even better.

Update: In the comments Thomas gave an even shorter version, at 368 characters:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

Legibly, at 375 characters:

object Alice {
  def main(a:Array[String]) {
    val t = (Map[String, Int]() /: (
      for (
        x <- io.Source.fromFile(a(0)).getLines
        y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)
      ) yield y.toLowerCase
    ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)
    val w = 76 - t.head._1.length
    print (" "+"_"*w)
    t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)
  }
}

votes

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"

Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

I envy C# and LINQ so much.

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

"Readable":

import java.util.*;
import java.io.*;
import static java.util.regex.Pattern.*;
class g
{
   public static void main(String[] a)throws Exception
      {
      PrintStream o = System.out;
      Map<String,Integer> w = new HashMap();
      Scanner s = new Scanner(new File(a[0]))
         .useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));
      while(s.hasNext())
      {
         String z = s.next().trim().toLowerCase();
         if(z.equals(""))
            continue;
         w.put(z,(w.get(z) == null?0:w.get(z))+1);
      }
      List<Integer> v = new Vector(w.values());
      Collections.sort(v);
      List<String> q = new Vector();
      int i,m;
      i = m = v.size()-1;
      while(q.size()<22)
      {
         for(String t:w.keySet())
            if(!q.contains(t)&&w.get(t).equals(v.get(i)))
               q.add(t);
         i--;
      }
      int r = 80-q.get(0).length()-4;
      String l = String.format("%1$0"+r+"d",0).replace("0","_");
      o.println(" "+l);
      o.println("|"+l+"| "+q.get(0)+" ");
      for(i = m-1; i > m-22; i--)
      {
         o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");
      }
   }
}

Output of Alice:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

Output of Don Quixote (also from Gutenberg):

 ________________________________________________________________________
|________________________________________________________________________| that
|________________________________________________________| he
|______________________________________________| for
|__________________________________________| his
|________________________________________| as
|__________________________________| with
|_________________________________| not
|_________________________________| was
|________________________________| him
|______________________________| be
|___________________________| don
|_________________________| my
|_________________________| this
|_________________________| all
|_________________________| they
|________________________| said
|_______________________| have
|_______________________| me
|______________________| on
|______________________| so
|_____________________| you
|_____________________| quixote

Build an ASCII chart of the most commonly used words in a given text [closed]

The challenge:

30 Answers

LabVIEW 51 nodes, 5 structures, 10 diagrams

Ruby 1.9, 185 chars

GolfScript, 177 175 173 167 164 163 144 131 130 chars

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Ruby 207 213 211 210 207 203 201 200 chars

Mathematica (297 284 248 244 242 199 chars) Pure Functional

and Zipf's Law Testing

Zipf's Law Testing

Edit 6 > (242 Chars)

Edit 7 → 199 characters

C# - 510 451 436 446 434 426 422 chars (minified)

Perl, 237 229 209 chars

Windows PowerShell, 199 chars

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

Python 2.x, latitudinarian approach = 227 183 chars

Rant

Python 2.x, punctilious approach = 277 243 chars

Haskell - 366 351 344 337 333 characters

JavaScript 1.8 (SpiderMonkey) - 354

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

PHP CLI version (450 chars)

Python 3.1 - 245 229 charaters

perl, 205 191 189 characters/ 205 characters (fully implemented)

Perl: 203 202 201 198 195 208 203 / 231 chars

F#, 452 chars

Python 2.6, 347 chars

*sh (+curl), partial solution

Gawk -- 336 (originally 507) characters

Common LISP, 670 characters

C (828)

Perl, 185 char

Java - 886 865 756 744 742 744 752 742 714 680 chars

Scala 2.8, 311 314 320 330 332 336 341 375 characters

Clojure 282 strict

Scala, 368 chars

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"

sh (+curl), partial* solution