I'm trying to read and analyse a huge CSV file. I used Data.Csv.Streaming from cassava, and functions are applied in the following order:
Data.ByteString.Lazy.readFile -- Gives lazy stream
Data.Csv.Streaming.decodeByname -- Gives Either String (Header Records t)
\(Right (_, v)) -> v -- Gives right side of either (Records t)
Data.Foldable.toList -- Gives [t]
After this the program enters the analysis stage, and executes four (this is very important) different instances (i.e. with different filters) of the following
filter -- Result of toList is applied through a filter
map
Data.Foldable.foldl' -- Does bin counting using a map. The map has at most 60 keys.
However, it appears that the program takes up a huge amount of memory while attempting to load the entire CSV file.
If I only have one instance of foldl' executing, the program does a nice single pass through the CSV data and doesn't consume as much memory. Is there a way to fuse the foldl's together? That is, having
x = foldl' f Map.empty $ filter cx li
y = foldl' f Map.empty $ filter cy li
...
and force it to execute in single pass.
Edit: The following function is used in foldl with Data.Map.Strict as Map:
bincollect :: Ord a => Num b => Map.Map a b -> a -> Map.Map a b
bincollect !m !key = Map.insertWith (+) key 1 m
and the foldl begins with an empty map.
The memory usage grows with the number of elements taked with or without optimization on.
fold'is a strict function. You may want to take a look atfoldrinstead - Welperoonifoldl'. In factfoldl'is the non-strict version offoldl. - Willem Van Onsemfoldl'doesn't need the whole list to be loaded into memory. Replacing it with the regularfoldlstill gives the same memory usage. - Henricus V.foldl', it should be OK.foldl'will need to scan the whole list, but it can be garbage collected while this is done if no one else is using the list. However, be sure that the function you pass tofoldl'is suitably strict, so that you do not end up with a huge thunk. E.g. you final 60-key map could be storing lazy thunks instead of evaluated counts. Perhaps you should post thefoldl'code. - chi