I have a dataset that was partitioned by column ID
and written to disk. This results in each partition getting its own folder in the filesystem. Now I am reading this data back in and would like to call groupBy('ID')
followed by calling a pandas_udf
function. My question is, since the data was partitioned by ID
, is groupBy('ID')
any faster than if it hadn't been partitioned? Would it be better to e.g. read one ID
at a time using the folder structure? I worry the groupBy
operation is looking through every record even though they've already been partitioned.
0
votes
1 Answers
0
votes
- You have partitioned by
ID
and saved to disk - You read it again and want to groupby and apply a pandas udf
It is obvious the groupby
will look through every record, and so will most functions. But using a pandas_udf
which groupby("ID")
is going to be expensive because it will go through an unnecessary shuffle.
You can optimize performance by using groupby
spark_partition_id()
since you have already partitioned by the column you want to groupby on.
EDIT:
If you want file names, you can try:
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())
group by
does look through every record unless you put awhere
clause. That will prune the partitions and only access theID
s which you specified – Dusan Vasiljevicwhere
beforegroupBy
? – cgreenwhere
will be faster if you that filters by the same partition column. If you are using agroup by
on everything, you will still get same plan and same number of stages (shuffle will happen) – Dusan Vasiljevic