0
votes

I have Spark 2.3 very big dataframe like this:

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AB |    2 |    1 |
|      AA |    2 |    3 |
|      AC |    1 |    2 |
|      AA |    3 |    2 |
|      AC |    5 |    3 |
-------------------------

I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AA |    2 |    3 |
|      AA |    3 |    2 |
-------------------------

and

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AC |    1 |    2 |
|      AC |    5 |    3 |
-------------------------

and so far. Every result dataframe I need to save as different csv file.

Count of keys is not big (20-30) but total count of data is (~200 millions records).

I have the solution where in the loop is selected every part of data and then saved to file:

val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList

keysList.foreach(k => {
      val dfi = df.where($"col_key" === lit(k))
      SaveDataByKey(dfi, path_to_save)
    })

It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time. I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :) May be, someone have ideas about it?

Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).

1
What does SaveDataByKey do? Is what you want to do simply to save the dataframe into different folders partitioned on the col_key column? - Shaido
I need to save data (extracted by different keys) to different csv files. SaveDataByKey does exactly this. - Ihor Konovalenko
I don't think it is duplicate because I want to use Spark's DataFrame API only. If it is possible. - Ihor Konovalenko

1 Answers

1
votes

You need to partition by column and save as csv file. Each partition save as one file.

yourDF
  .write
  .partitionBy("col_key")
  .csv("/path/to/save")

Why don't you try this ?