In Pyspark, what happens when you groupBy the same column as you used in partitionBy?

Question

I have a dataset that was partitioned by column ID and written to disk. This results in each partition getting its own folder in the filesystem. Now I am reading this data back in and would like to call groupBy('ID') followed by calling a pandas_udf function. My question is, since the data was partitioned by ID, is groupBy('ID') any faster than if it hadn't been partitioned? Would it be better to e.g. read one ID at a time using the folder structure? I worry the groupBy operation is looking through every record even though they've already been partitioned.

group by does look through every record unless you put a where clause. That will prune the partitions and only access the IDs which you specified — Dusan Vasiljevic
@DusanVasiljevic can you add some more detail, how would you use where before groupBy? — cgreen
I mean, after you have the partitioned data, using where will be faster if you that filters by the same partition column. If you are using a group by on everything, you will still get same plan and same number of stages (shuffle will happen) — Dusan Vasiljevic

pissall pissall · Accepted Answer · 2019-11-07T03:53:45

You have partitioned by ID and saved to disk
You read it again and want to groupby and apply a pandas udf

It is obvious the groupby will look through every record, and so will most functions. But using a pandas_udf which groupby("ID") is going to be expensive because it will go through an unnecessary shuffle.

You can optimize performance by using groupby spark_partition_id() since you have already partitioned by the column you want to groupby on.

EDIT:

If you want file names, you can try:

from  pyspark.sql.functions import input_file_name

df.withColumn("filename", input_file_name())

In Pyspark, what happens when you groupBy the same column as you used in partitionBy?

1 Answers

EDIT: