2
votes

I'm trying to read and analyse a huge CSV file. I used Data.Csv.Streaming from cassava, and functions are applied in the following order:

Data.ByteString.Lazy.readFile -- Gives lazy stream
Data.Csv.Streaming.decodeByname -- Gives Either String (Header Records t)
\(Right (_, v)) -> v -- Gives right side of either (Records t)
Data.Foldable.toList -- Gives [t]

After this the program enters the analysis stage, and executes four (this is very important) different instances (i.e. with different filters) of the following

filter -- Result of toList is applied through a filter
map
Data.Foldable.foldl' -- Does bin counting using a map. The map has at most 60 keys.

However, it appears that the program takes up a huge amount of memory while attempting to load the entire CSV file.

If I only have one instance of foldl' executing, the program does a nice single pass through the CSV data and doesn't consume as much memory. Is there a way to fuse the foldl's together? That is, having

x = foldl' f Map.empty $ filter cx li
y = foldl' f Map.empty $ filter cy li
...

and force it to execute in single pass.

Edit: The following function is used in foldl with Data.Map.Strict as Map:

bincollect :: Ord a => Num b => Map.Map a b -> a -> Map.Map a b
bincollect !m !key = Map.insertWith (+) key 1 m

and the foldl begins with an empty map.

The memory usage grows with the number of elements taked with or without optimization on.

1
fold' is a strict function. You may want to take a look at foldr instead - Welperooni
This is du to foldl'. In fact foldl' is the non-strict version of foldl. - Willem Van Onsem
@Welperooni foldl' doesn't need the whole list to be loaded into memory. Replacing it with the regular foldl still gives the same memory usage. - Henricus V.
If your list is only passed to foldl', it should be OK. foldl' will need to scan the whole list, but it can be garbage collected while this is done if no one else is using the list. However, be sure that the function you pass to foldl' is suitably strict, so that you do not end up with a huge thunk. E.g. you final 60-key map could be storing lazy thunks instead of evaluated counts. Perhaps you should post the foldl' code. - chi
Am I the only one here thinking that strictness is not to blame? The red flag for me is this sentence: "After this the program executes four different instances of the following...". Unless you are re-reading the file and re-parsing it after each instance, you won't be able to garbage-collect already-processed parts of the file, so of course it will force the whole thing into memory. But to know for sure what's going wrong, we need an MCVE. - Daniel Wagner

1 Answers

2
votes

Yes, you can indeed fuse the four folds together, but you'll have to do it manually. You could try and write out the logic yourself, or you could use a library (like foldl) to help. For instance, you can turn your bincollect into a fold:

bincollect :: (Ord a, Num b) => Fold a (Map.Map a b)
bincollect = Fold (\m key -> Map.insertWith (+) key 1 m) Map.empty id

Then, you can filter using prefilter:

x = prefilter cx bincollect

Finally, you can combine them together using the Applicative instance:

(w,x,y,z) = fold ((,,,) <$> prefilter cw bincollect
                        <*> prefilter cx bincollect
                        <*> prefilter cy bincollect
                        <*> prefilter cz bincollect)
                 input