I have a dataframe with next schema:
root
|-- id_1: long (nullable = true)
|-- id_2: long (nullable = true)
|-- score: double (nullable = true)
The data looks like :
+----+----+------------------+
|id_1|id_2|score |
+----+----+------------------+
|0 |9 |0.5888888888888889|
|0 |1 |0.6166666666666667|
|0 |2 |0.496996996996997 |
|1 |9 |0.6222222222222221|
|1 |6 |0.9082996632996633|
|1 |5 |0.5927450980392157|
|2 |3 |0.665774107440774 |
|3 |8 |0.6872367465504721|
|3 |8 |0.6872367465504721|
|5 |6 |0.5365909090909091|
+----+----+------------------+
The goal is to find id_2 for each id_1 with max score. Maybe I'm wrong, but... just need to create paired RDD:
root
|-- _1: long (nullable = true)
|-- _2: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: double (nullable = true)
+---+----------------------+
|_1 |_2 |
+---+----------------------+
|0 |[9,0.5888888888888889]|
|0 |[1,0.6166666666666667]|
|0 |[2,0.496996996996997] |
|1 |[9,0.6222222222222221]|
|1 |[6,0.9082996632996633]|
|1 |[5,0.5927450980392157]|
|2 |[3,0.665774107440774] |
|3 |[8,0.6872367465504721]|
|3 |[8,0.6872367465504721]|
|5 |[6,0.5365909090909091]|
+---+----------------------+
and reduce by key with max. Something like
paired_rdd.reduceByKey(lambda x1, x2: max(x1, x2, key=lambda x: x[-1]))
Or the same with DataFrame API (without paired rdd):
original_df.groupBy('id_1').max('score')
I have two questions and will be appreciated if someone can point on my wrong steps.
For 1 billion or even 100 billion records: what are best practices to achieve the goal (find id_2 for each id_1 with max score)? I've tried with 50 million and 100M records and had better results with shuffling the data (which is opposite to what Holden Karau told). I've done repartitioning by
id_1
.repartition(X, "id_1")
then reduceByKey and it was faster. Why?
Why DataFrame API was slower several times vs RDD API? Where I'm wrong?
Thanks.
original_df
partitioned? – Raphael RothgroupBy
of the dataframe API will not make proper use of your cluster resources (because you don't have enough parallelism) – Raphael Roth