I need to take 20 results from a lazy sequence of millions of hash-maps but for the 20 to be based on sorting on various values within the hashmaps.
For example:
(def population [{:id 85187153851 :name "anna" :created #inst "2012-10-23T20:36:25.626-00:00" :rank 77336}
{:id 12595145186 :name "bob" :created #inst "2011-02-03T20:36:25.626-00:00" :rank 983666}
{:id 98751563911 :name "cartmen" :created #inst "2007-01-13T20:36:25.626-00:00" :rank 112311}
...
{:id 91514417715 :name "zaphod" :created #inst "2015-02-03T20:36:25.626-00:00" :rank 9866}]
In normal circumstances a simple sort-by
would get the job done:
(sort-by :id population)
(sort-by :name population)
(sort-by :created population)
(sort-by :rank population)
But I need to do this across millions of records as fast as possible and want to do it lazily rather than having to realize the entire data set.
I looked around a lot and found a number of implementations of algorithms that work really well for sorting a sequence of values (mostly numeric) but none for a lazy sequence of hash-maps in the way I need.
Speed & efficiency being of prime importance, the best I have found has been the quicksort example from the Joy Of Clojure book (Chapter 6.4) which does just enough work to return the required result.
(ns joy.q)
(defn sort-parts
"Lazy, tail-recursive, incremental quicksort. Works against
and creates partitions based on the pivot, defined as 'work'."
[work]
(lazy-seq
(loop [[part & parts] work]
(if-let [[pivot & xs] (seq part)]
(let [smaller? #(< % pivot)]
(recur (list*
(filter smaller? xs)
pivot
(remove smaller? xs)
parts)))
(when-let [[x & parts] parts]
(cons x (sort-parts parts)))))))
(defn qsort [xs]
(sort-parts (list xs)))
Works really well...
(time (take 10 (qsort (shuffle (range 10000000)))))
"Elapsed time: 551.714003 msecs"
(0 1 2 3 4 5 6 7 8 9)
Great! But...
However much I try I can't seem to work out how to apply this to a sequence of hashmaps.
I need something like:
(take 20 (qsort-by :created population))