Configuration:
Spark 3.0.1
Cluster Databricks( Driver c5x.2xlarge, Worker (2) same as driver )
Source : S3
Format : Parquet
Size : 50 mb
File count : 2000 ( too many small files as they are gettin ng dumped from kinesis stream with 1 min batch as we cannot have more latency 99)
Problem Statement : I have 10 jobs with similar configuration and processing similar volume of data as above. When I run them individually, they take 5-6 mins each including cluster spin up time.
But when I run them together, they all seem kind of stuck at the same point in the code and takes 40-50 mins to complete.
When I check the spark UI, I see, all the jobs spent 90% of the time while taking the source count :
df = spark.read.parquet('s3a//....') df.cache() df.count() ----- problematic step ....more code logic
Now I know taking the count before doing cache should be faster for parquet files, but they were taking even more time if I don't cache the dataframe before taking the count, probably because of the huge number of small files.
But what I fail to understand is how the job is running way faster when ran one at a time?
Is S3 my bottleneck? They are all reading from the same bucket but different paths.
** I'm using privecera tokens for authentication.