We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks