2
votes

okay, I am working on processing the English Wikipedia Dump with dbpedia. So far their implementation extends Traversable and provides a foreach to go over the dump. However, I would like to have the typical map operations such as map, grouped etc. Here is the issue I opened: https://github.com/dbpedia/extraction-framework/issues/140

So I added a getter to receive an iterable and an iterator. Now the interesting part:

source.iterable
      .map(parser)
      .zipWithIndex
      .map { case(page: PageMode, i: Int) =>
                 if(i%1000 == 0){println(i)}
                 (...)
            }
      .grouped(2000)

The code above runs out of memory. However:

source.iterator
      .map(parser)
      .zipWithIndex
      .map { case(page: PageNode, i: Int) =>
                 if(i%1000 == 0){println(i)}
                 (...)
            }
      .grouped(2000)

This code returns immediately as one would expect.

It seems to me that the first example run through the code completely once an runs out of memory because it tries to store the dump in memory. The later does not. However, the later returns an iterator over Seq instead an iterator over iterators.

Is this expected of an iterable class or am I doing something wrong. I would expect that both return immediately and consume memory only once they are iterated.

Thx for your help! Karsten

2
You may find this post useful.silverbeak

2 Answers

3
votes

by default all collections (except stream and views) in Scala are strict, so each function over collection:

pages
  .map(parser)
  .zipWithIndex
  .map { partialFunction }

will return new collection. You could avoid some intermediate results using view, and then forcing it back to your collection type:

pages.view
  .map(parser)
  .zipWithIndex
  .map { partialFunction }
  .force

for more details http://www.scala-lang.org/docu/files/collections-api/collections_42.html

0
votes

Calling iterable returns an Iterable, and that just means a collection that has an iterator method. So:

  • source.iterable returns an iterable collection, which may or may not be kept entirely in memory
  • but then map, zipWithIndex, map, and grouped all produce intermediate collections

Calling iterator on the the other hand:

  • source.iterator returns an Iterator over something which may or may or may not be entirely in memory
  • then map, zipWithIndex, map and grouped won't create intermediate collections (they create new iterators)

It seems to me that this explains why the first example can more easily run out of memory.