1
votes

I'm reading in a spark dataframe that's stored in the parquet format on the local cluster's HDFS. The parquet data is split among approx 96,000 individual files. Now I know that ideally the data wouldn't be split into so many small files, but for now I've got to deal with it in this format. I'm using pyspark v2.2.0.

When I run spark.read.parquet(data_root), something strange happens: spark sequentially spawns a series of jobs, each with about 2000 tasks. It spawns up 48 of these jobs, each with one stage. Across these 48 jobs it executes just around 96,000 tasks - I assume it runs a task for each parquet file. Each job only takes about 2 seconds to run.

The thing I find strange is that this doesn't happen in one job with 96,000 tasks, because that would be faster (no stage boundaries). Where does the number 2000 come from? Is there a parameter I can tune to force more of these tiny tasks into the same job, thereby speeding things up?

1
You can use parquet-tools to merge all of them together first and then use Spark to read it.philantrovert

1 Answers

1
votes

This is a new feature of Spark2.0. FileSourceStrategy Combines the smaller parquet files to make a larger file so that it can work in optimized way. Ideally each spark executor wants to work on a block size equivalent to the HDFS block size (128MB).