haskell reading a vector of pairs from bytestring very slow, how to make it faster?

Question

I am trying to read a large vector of custom data type from a binary file. I tried to use the example given here.

The trouble with the example code is, it uses lists and I want to use vectors. So I adapted that code as below, but it takes very long time (more than a minute, I gave up after that) to read even 1 MB file.

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec

main = do
  b <- BIN.readFile "dat.bin" -- about 1 MB size file
  let v = runGet getPairs (BIN.tail b) -- skip the first byte
  putStrLn $ show $ Vec.length v

getPair :: Get (Word8, Word8)
getPair = do
  price <- getWord8
  qty <- getWord8
  return (price, qty)

getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
 empty <- isEmpty
 if empty
   then return Vec.empty
   else do pair  <- getPair
           pairs <- getPairs
           return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?

When I tried to run it with ghc --make -O2 pairs.hs I got the error Stack space overflow: current size ...

How to efficiently read pairs of values from bytestring into vector?

Again, I wish to get complete working code not just only pointers to Haskell site or RWH nor a just list of function/module names.

you shouldn't demand a complete working solution as answer. SO is a learning and problem understanding site not a crowd sourced coding service. — Erik Kaplun
@ErikAllik I am demanding complete working solution in the same spirit of Smallest Complete Working Example. Also I do give minimum working example. Another reason is the rarity of working examples on the Haskell sites. The problems I ask solutions for are generally very small snippets, hardly sufficient to accusing me of crowd sourcing code. The SO is great and Haskell docuemntation is poor is why I need to make that demand. I apologize if it has hurt/offended or caused undue distress to someone. — mntk123
Yes, this is slow because V.cons is O(n). What did you expect? A million times a million is a lot! vector has perfectly good documentation. Note that the documentation for a Hackage module generally has a table of contents in the upper right. Check out the section on constructing vectors. — dfeuer
@dfeuer I expected it to be slow. But I didn't know any other method to construct the vector by incrementally adding elements to it as I go on reading values from the bytestring. As for why did I ask the question: I didn't understand any other method given on the hackage that can allow me to incrementally construct the vector as desired. — mntk123
Well, you basically can't construct a vector by incrementally adding to it! They're more like C or C++ arrays or FORTRAN vectors than like the flexible things some other environments call vectors. You can get such flexible things in Haskell if you like; the traditional one is Data.Sequence, but Ed Kmett's been working on some faster ones lately. — dfeuer

ErikR ErikR · Accepted Answer · 2015-09-19T13:10:49

Here are a couple of examples of creating Vectors from files. They are not the most efficient, but both run in just a couple of seconds in ghci.

module Main where

import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix

getFileSize :: String -> IO Int
getFileSize path = do
    stat <- getFileStatus path
    return (fromEnum $ fileSize stat)

readVector1 path = do
  size <- getFileSize path
  withBinaryFile path ReadMode $ \h -> do
    -- can also use: size <- hFileSize h
    let go _ = do bs <- BS.hGet h 2 
                  return (BS.index bs 0, BS.index bs 1)
    Vec.generateM (div size 2) go

pairs (a:b:rest) = (a,b) : pairs rest
pairs _          = []

readVector2 path = do
  contents <- BIN.readFile path
  -- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
  let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
        where go [] = Nothing
              go (p:ps) = Just (p, ps)
  return v

main = do
  v <- readVector1 "rand" -- large file
  print $ Vec.length v
  v <- readVector2 "rand"
  print $ Vec.length v

A third alternative:

readVector3 path = do
  contents <- BS.readFile path
  let size = BS.length contents
      v = Vec.generate (div (fromIntegral size) 2) go
            where go i = let a = BS.index contents (2*i)
                             b = BS.index contents (2*i+1)
                         in (a,b)
  return v

This one turns out to be the fastest of the three.

haskell reading a vector of pairs from bytestring very slow, how to make it faster?

2 Answers