Need help rewriting a Haskell program that groups words by anagrams in OCaml (or any ML)?

Question

I wrote a Haskell function that groups words by anagrams. I'm trying to learn OCaml, but I'm a little confused as to how use pattern matching in OCaml. Could someone help translate this to OCaml for me? Thank you!

This function takes a list of strings, and partitions it into a list of string lists, grouped by anagrams.

import Data.List

groupByAnagrams :: [String] -> [[String]]
groupByAnagrams []     = []
groupByAnagrams (x:xs) = let (listOfAnagrams, listOfNonAnagrams) = (partitionByAnagrams (sort x) xs)
                         in 
                         (x:listOfAnagrams):(groupByAnagrams listOfNonAnagrams)

This helper function takes a sorted string sortedStr, and a list of strings (the reason the string is sorted is so that I don't have to call sort on it every iteration). The string list is partitioned into two lists; one consisting of the strings that are anagrams to sortedStr, the other consisting of the strings that are not. The function returns the tuple that consists of these two lists.

partitionByAnagrams :: String -> [String] -> ([String], [String])
partitionByAnagrams sortedStr []     = ([], [])
partitionByAnagrams sortedStr (x:xs) 
         | (sortedStr == (sort x))   = let (listOfAnagrams, listOfNonAnagrams) = (partitionByAnagrams sortedStr xs)
                                       in
                                       (x:listOfAnagrams, listOfNonAnagrams)
         | otherwise                 = let (listOfAnagrams, listOfNonAnagrams) = (partitionByAnagrams sortedStr xs)
                                       in
                                       (listOfAnagrams, x:listOfNonAnagrams)

This is just a test case:

test1 = mapM_ print (groupByAnagrams ["opts", "alerting", "arrest", "bares", "drapes", "drawer", "emits", "least", "mate", "mates", "merit", "notes", "palest", "parses", "pores", "pots", "altering", "rarest", "baser", "parsed", "redraw", "items", "slate", "meat", "meats", "miter", "onset", "pastel", "passer", "poser", "spot", "integral", "raster", "bears", "rasped", "reward", "mites", "stale", "meta", "steam", "mitre", "steno", "petals", "spares", "prose", "stop", "relating", "raters", "braes", "spared", "warder", "smite", "steal", "tame", "tames", "remit", "stone", "plates", "sparse", "ropes", "tops", "triangle", "starer", "saber", "spread", "warred", "times", "tales", "team", "teams", "timer", "tones", "staple", "spears", "spore"])

**EDIT!!! This is a rewritten version of my function. Thanks to jrouquie for pointing out the inefficiency! **EDITED AGAIN ON 10/7 - used pattern matching on tuples for clarity, no need for all those fsts and snds.

groupByAnagrams2 :: [String] -> [[String]]
groupByAnagrams2 str = groupBySnd $ map (\s -> (s, (sort s))) str

groupBySnd :: [(String, String)] -> [[String]]
groupBySnd []           = []
groupBySnd ((s1,s2):xs) = let (listOfAnagrams, listOfNonAnagramPairs) = (partitionBySnd s2 xs)
                          in
                          (s1:listOfAnagrams):(groupBySnd listOfNonAnagramPairs)


partitionBySnd :: String -> [(String, String)] -> ([String], [(String, String)])
partitionBySnd sortedStr []                = ([], [])
partitionBySnd sortedStr ((s, sSorted):ss)
              | (sortedStr == sSorted)     = let (listOfAnagrams, listOfNonAnagramPairs) = (partitionBySnd sortedStr ss)
                                             in
                                             (s:listOfAnagrams, listOfNonAnagramPairs)
              | otherwise                  = let (listOfAnagrams, listOfNonAnagramPairs) = (partitionBySnd sortedStr ss)
                                             in
                                             (listOfAnagrams, (s, sSorted):listOfNonAnagramPairs)

Note that while sortedStr is indeed sorted only once, each element of the remainig list xs is sorted once for each call to partitionByAnagrams. You might want to sort (a copy of) the whole list once and for all. (This uses more memory.) — jrouquie
@jrouquie How will that decrease the number of sortings? Either way all strings in the list will be sorted, right? — dxuhuang
If you sort a sorted string it is at least a O(n) operation, because the algorithm still has to look at each element to see if the list is sorted. @jrouquie, how would that lead to more memory usage? — dflemstr
@jrouquie I think I know what you're saying - some of the remaining strings will be sorted more than once. — dxuhuang
On the example input ["foo", "bar", "qux", "egg"], your code calls partitionByAnagrams (sort x) xs 4 times, with xs taking resp. values ["bar", "qux", "egg"], ["qux", "egg"], ["egg"], []. Since partitionByAnagrams x xs sorts each element of xs once, "egg" is sorted 3 times. More precisely, this code calls sort Θ(n²) times, while it could call it only O(n) times. (edit : yes, you're right.) — jrouquie

Stefan Holdermans Stefan Holdermans · Accepted Answer · 2012-08-13T14:14:45

I have to say that I find your Haskell code a bit clumsy. That is, your original function could have been written much more concise; for example:

import Control.Arrow ((&&&))
import Data.Function (on)
import Data.List (groupBy, sortBy)

anagrams :: Ord a => [[a]] -> [[[a]]]
anagrams =
  map (map fst) .
  groupBy ((==) `on` snd) .
  sortBy (compare `on` snd) .
  map (id &&& sortBy compare)

That is:

map (id &&& sortBy compare) pairs each string in the list with a sorted list of its characters;
sortBy (on compare snd) sorts the list of pairs that you now have on their second components, i.e., the sorted list of characters;
groupBy (on (==) snd) groups all consecutive items in the sorted list that have identical lists of sorted characters;
finally, map (map fst) drops the lists of sorted characters and leaves you with just the original strings.

For example:

Prelude> :m + Control.Arrow Data.Function Data.List

Prelude Control.Arrow Data.Function Data.List> ["foo", "bar", "rab", "ofo"]
["foo","bar","rab","ofo"]

Prelude Control.Arrow Data.Function Data.List> map (id &&& sortBy compare) it
[("foo","foo"),("bar","abr"),("rab","abr"),("ofo","foo")]

Prelude Control.Arrow Data.Function Data.List> sortBy (compare `on` snd) it
[("bar","abr"),("rab","abr"),("foo","foo"),("ofo","foo")]

Prelude Control.Arrow Data.Function Data.List> groupBy ((==) `on` snd) it
[[("bar","abr"),("rab","abr")],[("foo","foo"),("ofo","foo")]]

Prelude Control.Arrow Data.Function Data.List> map (map fst) it
[["bar","rab"],["foo","ofo"]]

"Translating" to Caml will then leave you with something along the lines of

let chars xs =
  let n = String.length xs in
  let rec chars_aux i =
    if i = n then [] else String.get xs i :: chars_aux (i + 1)
  in
  List.sort compare (chars_aux 0)

let group eq xs =
  let rec group_aux = function
    | [] -> []
    | [x] -> [[x]]
    | x :: xs ->
        let ((y :: _) as ys) :: yss = group_aux xs in
        if eq x y then (x :: ys) :: yss else [x] :: ys :: yss
  in
  group_aux xs

let anagrams xs =
  let ys = List.map chars xs in
  let zs = List.sort (fun (_,y1) (_,y2) -> compare y1 y2) (List.combine xs ys) in
  let zs = group (fun (_,y1) (_,y2) -> y1 = y2) zs in
  List.map (List.map fst) zs

Here, the helper function chars takes a string to a sorted list of characters, while group should give you some insight in how to do pattern matching on lists in Caml.

Need help rewriting a Haskell program that groups words by anagrams in OCaml (or any ML)?

2 Answers