Fast split Spark dataframe by keys in some column and save as different dataframes

Question

I have Spark 2.3 very big dataframe like this:

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AB |    2 |    1 |
|      AA |    2 |    3 |
|      AC |    1 |    2 |
|      AA |    3 |    2 |
|      AC |    5 |    3 |
-------------------------

I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AA |    2 |    3 |
|      AA |    3 |    2 |
-------------------------

and

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AC |    1 |    2 |
|      AC |    5 |    3 |
-------------------------

and so far. Every result dataframe I need to save as different csv file.

Count of keys is not big (20-30) but total count of data is (~200 millions records).

I have the solution where in the loop is selected every part of data and then saved to file:

val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList

keysList.foreach(k => {
      val dfi = df.where($"col_key" === lit(k))
      SaveDataByKey(dfi, path_to_save)
    })

It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time. I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :) May be, someone have ideas about it?

Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).

What does SaveDataByKey do? Is what you want to do simply to save the dataframe into different folders partitioned on the col_key column? — Shaido
I need to save data (extracted by different keys) to different csv files. SaveDataByKey does exactly this. — Ihor Konovalenko
Possible duplicate of How do I split an RDD into two or more RDDs? — user10938362
I don't think it is duplicate because I want to use Spark's DataFrame API only. If it is possible. — Ihor Konovalenko

Muhunthan Muhunthan · Accepted Answer · 2019-05-09T09:05:19

You need to partition by column and save as csv file. Each partition save as one file.

yourDF
  .write
  .partitionBy("col_key")
  .csv("/path/to/save")

Why don't you try this ?

Fast split Spark dataframe by keys in some column and save as different dataframes

1 Answers