I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n")
and re-partitioning a Spark DataFrame utilizing df.repartition(n)
.
The Spark documentation indicates that set spark.sql.shuffle.partitions=n
configures the number of partitions that are used when shuffling data, while df.repartition
seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition
and spark.sql.shuffle.partitions
to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
- Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set
sqlContext.sql("set spark.sql.shuffle.partitions=3")
and then diddf.repartition($"ID")
, I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C". - Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid skew, but to have each partition hold as much of the same key information as possible. How do I achieve that with
set spark.sql.shuffle.partitions
anddf.repartiton
?Is there a link between
set spark.sql.shuffle.partitions
anddf.repartition
? If so, what is that link?
Thanks!