I am using Spark 2.3 with Java 1.8
I have a RDD of CSV records say:
JavaRDD<CsvRecordsPerApp> csvRecordsRdd
Here each CsvRecordsPerApp has multiple values:
class CsvRecordsPerApp implements Serializable {
String customerName;
String supplierName;
String otherFieldName;
}
I want to save it in multiple folders such that each RDD is saved into 3 separate folders like
- customerNames\part-0000
- customerNames\part-0001
...
- supplierNames\part-0000
- supplierNames\part-0001
...
- otherFieldNames\part-0000
- otherFieldNames\part-0001
...
But when I do below, it saves all output file in single files:
JavaRDD<CsvRecordsPerApp> csvRecordsRdd = ...
csvRecordsRdd.saveAsTextFile("file-name");
like:
file-name/0000
file-name/0001
..
I tried is to map csvRecordsRdd into different values and save 3 times like below:
JavaRDD<String> customerNameRdd = csvRecordsRdd.map(csv -> csv.getCustomerName());
customerNameRdd.saveAsTextFile("customerNames");
JavaRDD<String> supplierNameRdd = csvRecordsRdd.map(csv -> csv.getSupplierName());
supplierNameRdd.saveAsTextFile("supplierNames");
JavaRDD<String> otherFieldNameRdd = csvRecordsRdd.map(csv -> csv.getOtherFieldName());
otherFieldNameRdd.saveAsTextFile("otherFieldName");
Here the problem is it recomputes RDD 3 times and I have triple entries!!
Then to stop recomputing, I tried below caching but it did not work and still computes 3 times:
csvRecordsRdd.persist(StorageLevel.MEMORY_AND_DISK()); or csvRecordsRdd.cache();
I am looking for ideas to solve the problem