3
votes

I am trying to read a large vector of custom data type from a binary file. I tried to use the example given here.

The trouble with the example code is, it uses lists and I want to use vectors. So I adapted that code as below, but it takes very long time (more than a minute, I gave up after that) to read even 1 MB file.

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec

main = do
  b <- BIN.readFile "dat.bin" -- about 1 MB size file
  let v = runGet getPairs (BIN.tail b) -- skip the first byte
  putStrLn $ show $ Vec.length v

getPair :: Get (Word8, Word8)
getPair = do
  price <- getWord8
  qty <- getWord8
  return (price, qty)

getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
 empty <- isEmpty
 if empty
   then return Vec.empty
   else do pair  <- getPair
           pairs <- getPairs
           return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?

When I tried to run it with ghc --make -O2 pairs.hs I got the error Stack space overflow: current size ...

How to efficiently read pairs of values from bytestring into vector?

Again, I wish to get complete working code not just only pointers to Haskell site or RWH nor a just list of function/module names.

2
you shouldn't demand a complete working solution as answer. SO is a learning and problem understanding site not a crowd sourced coding service.Erik Kaplun
@ErikAllik I am demanding complete working solution in the same spirit of Smallest Complete Working Example. Also I do give minimum working example. Another reason is the rarity of working examples on the Haskell sites. The problems I ask solutions for are generally very small snippets, hardly sufficient to accusing me of crowd sourcing code. The SO is great and Haskell docuemntation is poor is why I need to make that demand. I apologize if it has hurt/offended or caused undue distress to someone.mntk123
Yes, this is slow because V.cons is O(n). What did you expect? A million times a million is a lot! vector has perfectly good documentation. Note that the documentation for a Hackage module generally has a table of contents in the upper right. Check out the section on constructing vectors.dfeuer
@dfeuer I expected it to be slow. But I didn't know any other method to construct the vector by incrementally adding elements to it as I go on reading values from the bytestring. As for why did I ask the question: I didn't understand any other method given on the hackage that can allow me to incrementally construct the vector as desired.mntk123
Well, you basically can't construct a vector by incrementally adding to it! They're more like C or C++ arrays or FORTRAN vectors than like the flexible things some other environments call vectors. You can get such flexible things in Haskell if you like; the traditional one is Data.Sequence, but Ed Kmett's been working on some faster ones lately.dfeuer

2 Answers

2
votes

Here are a couple of examples of creating Vectors from files. They are not the most efficient, but both run in just a couple of seconds in ghci.

module Main where

import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix

getFileSize :: String -> IO Int
getFileSize path = do
    stat <- getFileStatus path
    return (fromEnum $ fileSize stat)

readVector1 path = do
  size <- getFileSize path
  withBinaryFile path ReadMode $ \h -> do
    -- can also use: size <- hFileSize h
    let go _ = do bs <- BS.hGet h 2 
                  return (BS.index bs 0, BS.index bs 1)
    Vec.generateM (div size 2) go

pairs (a:b:rest) = (a,b) : pairs rest
pairs _          = []

readVector2 path = do
  contents <- BIN.readFile path
  -- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
  let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
        where go [] = Nothing
              go (p:ps) = Just (p, ps)
  return v

main = do
  v <- readVector1 "rand" -- large file
  print $ Vec.length v
  v <- readVector2 "rand"
  print $ Vec.length v

A third alternative:

readVector3 path = do
  contents <- BS.readFile path
  let size = BS.length contents
      v = Vec.generate (div (fromIntegral size) 2) go
            where go i = let a = BS.index contents (2*i)
                             b = BS.index contents (2*i+1)
                         in (a,b)
  return v

This one turns out to be the fastest of the three.

2
votes

Here's an alternative approach for loading the vector, that uses pipes and pipes-bytestring to stream the file, and the vector function from foldl to create the vector:

{-# LANGUAGE PackageImports #-}
import Data.Functor (void)
import "pipes" Pipes
import qualified "pipes" Pipes.Prelude as P
import qualified "pipes-bytestring" Pipes.ByteString as B
import qualified "pipes-binary" Pipes.Binary as B
import qualified "vector" Data.Vector.Unboxed as V
import qualified "foldl" Control.Foldl as L
import "lens-family-core" Lens.Family (view)
import System.IO

main :: IO ()
main = do
    v <- withBinaryFile "somefile" ReadMode (\h ->
        -- for simplicity, errors are ignored with "void"
        L.impurely P.foldM L.vector (void (view B.decoded (B.drop 1 (B.fromHandle h)))))
    print (V.length (v::V.Vector (B.Word8,B.Word8)))

cons is inefficient. The approach taken by foldl's vector is to progressively double the vector's capacity using unsafeGrow, in order to accomodate incoming values, and at the end "trim" any excess capacity with unsafeTake.