I use Spark (especially pyspark) and have the following scenario:
- There are two tables (A and B)
- A is partitioned by columns c and d.
- I want to read all the data from A, do a tiny transformation (e.g. adding one static column) and then save the result to B.
- B should have the exact same partitioning as A.
- Shuffling needs to be avoided since we have 3 billion rows
So from a logical perspective, it should be very easy to make this happen without any shuffling.
But how?
Version 1
df = spark.read.parquet("pathToA")
df = df.withColumn("x", f.lit("x"))
df.write.format("parquet").save("pathToB")
In this case, table B is not partitioned at all.
Version 2
df = spark.read.parquet("pathToA")
df = df.withColumn("x", f.lit("x"))
df.write.format("parquet").partitionBy("c", "d").save("pathToB")
In this case, there is a lot of shuffling going in, it takes forever.
Version 3
df = spark.read.parquet("pathToA")
df = df.withColumn("x", f.lit("x"))
df = df.repartition("c", "d")
df.write.format("parquet").partitionBy("c", "d").save("pathToB")
Same as version 2 -> lots of shuffling, does never finish.
If anyone has an idea on how to archive this without shuffling, it would be very helpful! Thanks alot in advance!
Best regards, Benjamin