In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple: This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
- apply filter on this RDD and scope out relevant rows : Filter
- apply filter on this RDD and scope out relevant rows : Filter
- Join the RDDs: Join
- apply map phase to compute investment: Map
- apply GroupBy phase to group the data according to the desired view: GroupBy
- apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
- collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce. But reduce will work on object of type JavaPairRDD>. So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!