Saving Spark Java RDD such that each RDD value is saved into separate files in separate folders

Question

I am using Spark 2.3 with Java 1.8

I have a RDD of CSV records say:

JavaRDD<CsvRecordsPerApp> csvRecordsRdd

Here each CsvRecordsPerApp has multiple values:

class CsvRecordsPerApp implements Serializable {
    String customerName;
    String supplierName;
    String otherFieldName;
}

I want to save it in multiple folders such that each RDD is saved into 3 separate folders like

- customerNames\part-0000
- customerNames\part-0001
...
- supplierNames\part-0000
- supplierNames\part-0001
...

- otherFieldNames\part-0000
- otherFieldNames\part-0001
...

But when I do below, it saves all output file in single files:

JavaRDD<CsvRecordsPerApp> csvRecordsRdd = ...
csvRecordsRdd.saveAsTextFile("file-name");

like:

file-name/0000
file-name/0001
..

I tried is to map csvRecordsRdd into different values and save 3 times like below:

JavaRDD<String> customerNameRdd = csvRecordsRdd.map(csv -> csv.getCustomerName());
customerNameRdd.saveAsTextFile("customerNames");

JavaRDD<String> supplierNameRdd = csvRecordsRdd.map(csv -> csv.getSupplierName());
supplierNameRdd.saveAsTextFile("supplierNames");

JavaRDD<String> otherFieldNameRdd = csvRecordsRdd.map(csv -> csv.getOtherFieldName());
otherFieldNameRdd.saveAsTextFile("otherFieldName");

Here the problem is it recomputes RDD 3 times and I have triple entries!!

Then to stop recomputing, I tried below caching but it did not work and still computes 3 times:

csvRecordsRdd.persist(StorageLevel.MEMORY_AND_DISK()); or csvRecordsRdd.cache();

I am looking for ideas to solve the problem

Can you put the 3 RDDs in a case class, then map the csvRecordsRDD to the case class? That should limit it to one pass vs 3. (Warning: I'm a spark/scala neophyte myself.) — Devon_C_Miller

Rajesh Goel Rajesh Goel · Accepted Answer · 2018-09-14T14:51:05

Here the solution of caching works (sorry guys, I forgot to update earlier).

Because I changed other configuration like spart-submit driver executor memory from 1 gb (default) to 20 gb or so (depending on your system's availability like on my desktop I increased it to 5 gb but on EMR I increased it to 20 gb or more).

I think this is just a workaround, since it caches objects. Cache has a limit so it might fail for bigger data and definitely it requires bigger m/c's.

So please, suggest more better solutions.

Saving Spark Java RDD such that each RDD value is saved into separate files in separate folders

1 Answers