I'm reading in a spark dataframe that's stored in the parquet format on the local cluster's HDFS. The parquet data is split among approx 96,000 individual files. Now I know that ideally the data wouldn't be split into so many small files, but for now I've got to deal with it in this format. I'm using pyspark v2.2.0.
When I run spark.read.parquet(data_root)
, something strange happens: spark sequentially spawns a series of jobs, each with about 2000 tasks. It spawns up 48 of these jobs, each with one stage. Across these 48 jobs it executes just around 96,000 tasks - I assume it runs a task for each parquet file. Each job only takes about 2 seconds to run.
The thing I find strange is that this doesn't happen in one job with 96,000 tasks, because that would be faster (no stage boundaries). Where does the number 2000 come from? Is there a parameter I can tune to force more of these tiny tasks into the same job, thereby speeding things up?
to merge all of them together first and then use Spark to read it. – philantrovert