0
votes

Suppose I have activity data for all countries in the world (relationally speaking each record will have a country column), and I want to enrich that with some reference data also available per country and thereafter use this as input to some ML algorithm. Moreover, I know that each joined set (keyed and joined by country) fits in memory on a single executor, what is the most effecient way of processing this data at country level and unioning the results to aggregate the overall output.

My current thoughts are:

  1. Partition and key both data sets by country, join the two data sets at country level, and thereafter use pure Scala code (within a map function) to process the data before converting the partial outputs to datasets for a final union.
  2. Partition and key both data sets by country, join the two data sets at country level, stick to Datasets throughout and let Spark 2 optimize the computation - e.g. benefit from efficient serialized representation of the data, whole-stage-code-generation e.t.c.

Secondly, with option 2), how can I ensure after the initial keying and joining by country that all subsequent aggregation operations like groupBy() e.t.c. have narrow dependencies (since all data will be in the same 'country' partition). Would I simply need to include the country column in the keys I use in my aggregation function - e.g. groupBy(country, key1, key2...)?

1

1 Answers

0
votes

I would do three basic things:

  • Partition on country key
  • Use mapPartitions to transform Iterables of data on a per-partition base
  • Use reduceByKey or aggregateBykey for aggregations

All of this should allow you be as efficient as possible in terms of minimizing network shuffle and disk I/O while you do all your real work before you perform your write action at the end.