I'm reading in a spark dataframe that's stored in the parquet format on the local cluster's HDFS. The parquet data is split among approx 96,000 individual files. Now I know that ideally the data wouldn't be split into so many small files, but for now I've got to deal with it in this format. I'm using pyspark v2.2.0.
When I run spark.read.parquet(data_root)
, something strange happens: spark sequentially spawns a series of jobs, each with about 2000 tasks. It spawns up 48 of these jobs, each with one stage. Across these 48 jobs it executes just around 96,000 tasks - I assume it runs a task for each parquet file. Each job only takes about 2 seconds to run.
The thing I find strange is that this doesn't happen in one job with 96,000 tasks, because that would be faster (no stage boundaries). Where does the number 2000 come from? Is there a parameter I can tune to force more of these tiny tasks into the same job, thereby speeding things up?
parquet-tools
to merge all of them together first and then use Spark to read it. – philantrovert