I use haskell for line-based data processing, i.e. tasks where you can apply sed, awk and similar tools. As a trivial example, let's prepend 000 to every line from standard input.
I have three alternative ways to do the task:
- Lazy IO with lazy
ByteStrings - Line-based conduit.
- Chunk-based conduit with pure strict
ByteStringprocessing inside.
example.hs:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
import ClassyPrelude.Conduit
import qualified Data.ByteString.Char8 as B8
import qualified Data.ByteString.Lazy.Char8 as BL8
import qualified Data.Conduit.Binary as CB
main = do
[arg] <- getArgs
case arg of
"lazy" -> BL8.getContents >>= BL8.putStr . BL8.unlines . map ("000" ++) . BL8.lines
"lines" -> runConduitRes $ stdinC .| CB.lines .|
mapC ("000" ++) .| mapC (`snoc` 10) .| stdoutC
"chunks" -> runConduitRes $ stdinC .| lineChunksC .|
mapC (B8.unlines . (map ("000" ++)) . B8.lines) .| stdoutC
lineChunksC :: Monad m => Conduit ByteString m ByteString
lineChunksC = await >>= maybe (return ()) go
where
go acc = if
| Just (_, 10) <- unsnoc acc -> yield acc >> lineChunksC
| otherwise -> await >>= maybe (yield acc) (go' . breakAfterEOL)
where
go' (this, next) = let acc' = acc ++ this in if null next then go acc' else yield acc' >> go next
breakAfterEOL :: ByteString -> (ByteString, ByteString)
breakAfterEOL = uncurry (\x -> maybe (x, "") (first (snoc x)) . uncons) . break (== 10)
$ stack ghc --package={classy-prelude-conduit,conduit-extra} -- -O2 example.hs -o example
$ for cmd in lazy lines chunks; do echo $cmd; time -p seq 10000000 | ./example $cmd > /dev/null; echo; done
lazy
real 2.99
user 3.06
sys 0.07
lines
real 3.30
user 3.36
sys 0.06
chunks
real 1.83
user 1.95
sys 0.06
(The results are consistent across multiple runs, and also hold for lines with several numbers).
So chunks is 1.6x faster than lines which is a bit faster than lazy. This means that conduits can be faster than plain bytestrings, but the overhead of conduit pipes is too heavy when you split chunks into short lines.
What I don't like about chunks approach is that it mixes both conduit and pure worlds, and it makes it harder to use it for more complex tasks.
The question is, did I miss a simple and elegant solution which would allow me to write efficient code in same fashion as with lines approach?
EDIT1: Per @Michael's suggestion I joined two mapC into one mapC (("000" ++). (snoc10)) in lines solution, to make number of pipes (.|) same between lines and chunks. This made it perform a bit better (down from 3.3s to 2.8s), but still significantly slower than chunks.
Also I tried older Conduit.Binary.lines which Michael suggested in comments, and it also improves performance a bit, by ~0.1s.
EDIT2: Fixed lineChunksC so it works with very small chunks, e.g.
> runConduitPure $ yield ("\nr\n\n"::ByteString) .| concatC .| mapC singleton .| lineChunksC .| sinkList
["\n","r\n","\n"]
mapC (`snoc` 10)? - chepner'\n', so it appends newline character back to each line. - modularstdinC' = (stdinC :: Source (ResourceT IO) B.ByteString) .| concatC .| mapC B.singletonThis doesn't affect the semantics ofstdinCbut it breaks this function, but not the conduit lines function. - Michaelchunksfunction allocates as many intermediate bytestrings as there are bytes in the line. It is rational where you expect one or two chunks per line. This is fairly typical of course. - MichaelByteString.lineswhere possible - but as expected, if I feed it a file that is 10 million 't's with no linebreaks, it is much slower sprunge.us/APeK If I give the file in single byte chunks char by char, it doesn't complete after 5 minutes, since it is allocating 10M separate bytestrings, while the other is done in 5 sec. It is I think quadratic in the number of chunks per line. Here's my source sprunge.us/gVQG - Michael