Scala - Lazyness of Iterator and Iterable - Memory consumption

Question

okay, I am working on processing the English Wikipedia Dump with dbpedia. So far their implementation extends Traversable and provides a foreach to go over the dump. However, I would like to have the typical map operations such as map, grouped etc. Here is the issue I opened: https://github.com/dbpedia/extraction-framework/issues/140

So I added a getter to receive an iterable and an iterator. Now the interesting part:

source.iterable
      .map(parser)
      .zipWithIndex
      .map { case(page: PageMode, i: Int) =>
                 if(i%1000 == 0){println(i)}
                 (...)
            }
      .grouped(2000)

The code above runs out of memory. However:

source.iterator
      .map(parser)
      .zipWithIndex
      .map { case(page: PageNode, i: Int) =>
                 if(i%1000 == 0){println(i)}
                 (...)
            }
      .grouped(2000)

This code returns immediately as one would expect.

It seems to me that the first example run through the code completely once an runs out of memory because it tries to store the dump in memory. The later does not. However, the later returns an iterator over Seq instead an iterator over iterators.

Is this expected of an iterable class or am I doing something wrong. I would expect that both return immediately and consume memory only once they are iterated.

Thx for your help! Karsten

iuriisusuk iuriisusuk · Accepted Answer · 2013-12-13T11:59:42

by default all collections (except stream and views) in Scala are strict, so each function over collection:

pages
  .map(parser)
  .zipWithIndex
  .map { partialFunction }

will return new collection. You could avoid some intermediate results using view, and then forcing it back to your collection type:

pages.view
  .map(parser)
  .zipWithIndex
  .map { partialFunction }
  .force

for more details http://www.scala-lang.org/docu/files/collections-api/collections_42.html

Scala - Lazyness of Iterator and Iterable - Memory consumption

2 Answers