[Note: Title and text were heavily edited to make it more clear that I'm not particularly after strings, but after general sequences, and lazy processing of same]
Using character sequences / strings as an example, say I want to turn a string like
"\t a\r s\td \t \r \n f \r\n"
into
" a s d f "
In more general terms, I want to turn all contiguous whitespace (or any other arbitray set of items) in a sequence into a single item, and that lazily.
I've come up with the following partition-by/mapcat combo, but wonder if there are easier or otherwise better ways (readability, performance, anything) to accomplish the same thing.
(defn is-wsp?
[c]
(if (#{\space \tab \newline \return} c) true))
(defn collapse-wsp
[coll]
(mapcat
(fn [[first-elem :as s]]
(if (is-wsp? first-elem) [\space] s))
(partition-by is-wsp? coll)))
In action:
=> (apply str (collapse-wsp "\t a\r s\td \t \r \n f \r\n"))
" a s d f "
Update: I used strings / character sequences / wsp, as an example, but what I actually want is a generic function on sequences of any type that collapses arbitrary numbers of contiguous items, which are part of a predefined set of items, by some single predefined item. I'm particularly interested to know if there are better alternatives to partition-by/mapcat, not so much if this can be optimized for the 'string' special case.
Update 2:
Here's a fully lazy version - the one above isn't fully lazy, I fear, apart from that it is doing redundant is-wsp? checks. I generalized the parameter names etc. so it doesn't just look like something you could easily replace by a String.whatever() call - it's about arbitrary sequences.
(defn lazy-collapse
([coll is-collapsable-item? collapsed-item-representation] (lazy-collapse coll is-collapsable-item? collapsed-item-representation false))
([coll is-collapsable-item? collapsed-item-representation in-collapsable-segment?]
(let [step (fn [coll in-collapsable-segment?]
(when-let [item (first coll)]
(if (is-collapsable-item? item)
(if in-collapsable-segment?
(recur (rest coll) true)
(cons collapsed-item-representation (lazy-collapse (rest coll) is-collapsable-item? collapsed-item-representation true)))
(cons item (lazy-collapse (rest coll) is-collapsable-item? collapsed-item-representation false)))))]
(lazy-seq (step coll in-collapsable-segment?)))))
This is fast, fully lazy, but I'd like to be able to express that more concisely, as I am quite lazy myself.
Benchmarks of the lazy collapsers so far: Whether the code is readable or not is easy to judge by looking at the code, but in order to see how they compare in terms of performance, here are my benchmarks. I first check if the function does what it is supposed to do, and then I spit out how long it takes to
- create the lazy seq 1M times
- create the lazy seq and take the first item 1M times
- create the lazy seq and take the second item 1M times
- create the lazy seq and take the last item (i.e. fully realize the lazy seq) 1M times
Tests 1 through 3 are meant to gauge the laziness at least a little bit. I ran the test a couple of times, and there was no significant variation in the execution times.
user=> (map
(fn [collapse]
(println (class collapse) (str "|" (apply str (collapse test-str is-wsp? \space)) "|"))
(time (dotimes [_ 1000000] (collapse test-str is-wsp? \space)))
(time (dotimes [_ 1000000] (first (collapse test-str is-wsp? \space))))
(time (dotimes [_ 1000000] (second (collapse test-str is-wsp? \space))))
(time (dotimes [_ 1000000] (last (collapse test-str is-wsp? \space)))))
[collapse-overthink collapse-smith collapse-normand lazy-collapse])
user$collapse_overthink | a s d f |
"Elapsed time: 153.490591 msecs"
"Elapsed time: 3064.721629 msecs"
"Elapsed time: 4337.932487 msecs"
"Elapsed time: 24797.222682 msecs"
user$collapse_smith | a s d f |
"Elapsed time: 141.474904 msecs"
"Elapsed time: 812.998848 msecs"
"Elapsed time: 2112.331739 msecs"
"Elapsed time: 10750.224816 msecs"
user$collapse_normand | a s d f |
"Elapsed time: 314.978309 msecs"
"Elapsed time: 1423.779761 msecs"
"Elapsed time: 1669.660257 msecs"
"Elapsed time: 8074.759077 msecs"
user$lazy_collapse | a s d f |
"Elapsed time: 169.906088 msecs"
"Elapsed time: 638.030401 msecs"
"Elapsed time: 1195.445016 msecs"
"Elapsed time: 6050.945856 msecs"
Bottom line so far: The nicest code is slowest, the ugliest code is fastest. I'm pretty sure it doesn't have to be this way...