My Spark application is as follow :
1) execute large query with Spark SQL into the dataframe "dataDF"
2) foreach partition involved in "dataDF" :
2.1) get the associated "filtered" dataframe, in order to have only the partition associated data
2.2) do specific work with that "filtered" dataframe and write output
The code is as follow :
val dataSQL = spark.sql("SELECT ...")
val dataDF = dataSQL.repartition($"partition")
for {
row <- dataDF.dropDuplicates("partition").collect
} yield {
val partition_str : String = row.getAs[String](0)
val filtered = dataDF.filter($"partition" .equalTo( lit( partition_str ) ) )
// ... on each partition, do work depending on the partition, and write result on HDFS
// Example :
if( partition_str == "category_A" ){
// do group by, do pivot, do mean, ...
val x = filtered
.groupBy("column1","column2")
...
// write final DF
x.write.parquet("some/path")
} else if( partition_str == "category_B" ) {
// select specific field and apply calculation on it
val y = filtered.select(...)
// write final DF
x.write.parquet("some/path")
} else if ( ... ) {
// other kind of calculation
// write results
} else {
// other kind of calculation
// write results
}
}
Such algorithm works successfully. The Spark SQL query is fully distributed. However the particular work done on each resulting partition is done sequentially, and the result is inneficient especially because each write related to a partition is done sequentially.
In such case, what are the ways to replace the "for yield" by something in parallel/async ?
Thanks
maporflatMapfunction argument. - Aleksey Isachenkov