Spark Databricks ultra slow read of parquet files

Question

Configuration:

Spark 3.0.1

Cluster Databricks( Driver c5x.2xlarge, Worker (2) same as driver )

Source : S3

Format : Parquet

Size : 50 mb

File count : 2000 ( too many small files as they are gettin ng dumped from kinesis stream with 1 min batch as we cannot have more latency 99)

Problem Statement : I have 10 jobs with similar configuration and processing similar volume of data as above. When I run them individually, they take 5-6 mins each including cluster spin up time.

But when I run them together, they all seem kind of stuck at the same point in the code and takes 40-50 mins to complete.

When I check the spark UI, I see, all the jobs spent 90% of the time while taking the source count :

df = spark.read.parquet('s3a//....') df.cache() df.count() ----- problematic step ....more code logic

Now I know taking the count before doing cache should be faster for parquet files, but they were taking even more time if I don't cache the dataframe before taking the count, probably because of the huge number of small files.

But what I fail to understand is how the job is running way faster when ran one at a time?

Is S3 my bottleneck? They are all reading from the same bucket but different paths.

** I'm using privecera tokens for authentication.

stevel stevel · Accepted Answer · 2021-06-10T13:56:44

They'll all be using the same s3a filesystem class instances in the worker nodes, there are some options there about the #of HTTP connections to have, fs.s3a.connection.maximum, default is 48. If all work is against the same bucket, set it to a number of 2x+ the number of worker threads. Do the same for "fs.s3a.max.total.tasks".
If using hadoop 2.8+ binaries switch the s3a client into the random IO mode which delivers best performance when seeking around parquet files, fs.s3a.experimental.fadvise = random.

change #2 should deliver speedup on single workloads, so do it anyway

Throttling would surface as 503 responses, which are handled in the AWS SDK and don't get collected/reported. I'd recommend that at least for debugging this you turn on S3 bucket logging, and scan the logs for 503 responses, which indicate throttling is taking place. It's what I do. Tip: set up a rule to delete old logs and so keep costs down; 1-2 weeks logs is generally enough for me.

Finally, lots of small files are bad on HDFS, awful with object stores as the time to list/open is so high. Try and make coalescing files step #1 in processing data

Spark Databricks ultra slow read of parquet files

1 Answers