okay, I am working on processing the English Wikipedia Dump with dbpedia. So far their implementation extends Traversable
and provides a foreach
to go over the dump. However, I would like to have the typical map operations such as map
, grouped
etc. Here is the issue I opened: https://github.com/dbpedia/extraction-framework/issues/140
So I added a getter to receive an iterable and an iterator. Now the interesting part:
source.iterable
.map(parser)
.zipWithIndex
.map { case(page: PageMode, i: Int) =>
if(i%1000 == 0){println(i)}
(...)
}
.grouped(2000)
The code above runs out of memory. However:
source.iterator
.map(parser)
.zipWithIndex
.map { case(page: PageNode, i: Int) =>
if(i%1000 == 0){println(i)}
(...)
}
.grouped(2000)
This code returns immediately as one would expect.
It seems to me that the first example run through the code completely once an runs out of memory because it tries to store the dump in memory. The later does not. However, the later returns an iterator over Seq instead an iterator over iterators.
Is this expected of an iterable class or am I doing something wrong. I would expect that both return immediately and consume memory only once they are iterated.
Thx for your help! Karsten