I have a Spark SQL that read my S3 JSON file(s) into a DataFrame.
I then run 2 SQL on that DataFrame and found SparkSQL read my S3 JSON files twice before executing each of my SQL.
If the DataFrame object is not being reused, it would be very costly...
Any help is appreciated.
Here is my code fragment:
protected boolean doAggregations() throws IOException {
SQLContext sqlContext = getSQLContext();
DataFrame edgeDataFrame = sqlContext.read().json(sourceDataDirectory);
edgeDataFrame.cache();
getLogger().info("Registering and caching the table 'edgeData'");
edgeDataFrame.registerTempTable("edgeData");
String dateKey = DateTimeUtility.SECOND_FORMATTER.print(System.currentTimeMillis());
for (Map.Entry<String, AggregationMetadata> entry : aggMetadataMap.entrySet()) {
String aggName = entry.getKey();
String resultDir = getAggregationResultDirectory(aggName, dateKey);
String sql = entry.getValue().getSql();
// The input file(s) are being read again and again instead of operating on the "edgeDataFrame"
DataFrame dataFrame = sqlContext.sql(sql);
dataFrame.write().format("json").save(resultDir);
}
return true;
}