0
votes

I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).

The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.

To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:

Let's say we have a DataFrame, like so:

ID | Val
--------
A  |  1
A  |  2
A  |  5
A  |  7
B  |  9
B  |  3
C  |  2
  1. Scenario 1: 3 Shuffle Partitions, Reparition DF by ID: If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
  2. Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.

Is my understanding off base here? In general, my questions are:

  1. I am trying to optimize my partitioning of a dataframe as to avoid skew, but to have each partition hold as much of the same key information as possible. How do I achieve that with set spark.sql.shuffle.partitions and df.repartiton?

  2. Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?

Thanks!

1

1 Answers

0
votes

I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".

No

5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.

and no.

This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).

Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?

Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.