The number of tasks in Spark is decided by the total number of RDD partitions at the beginning of stages. For example, when a Spark application is reading data from HDFS, the partition method for Hadoop RDD is inherited from
FileInputFormatin MapReduce, which is affected by the size of HDFS blocks, the value ofmapred.min.split.sizeand the compression method, etc.

The tasks in the screenshot took 7, 7, 4 seconds, and I want to make them balanced. Also, the stage is split into 3 tasks, are there any ways to specify Spark the number of partitions/tasks?
.repartition(200)operation first: spark.apache.org/docs/latest/… Nevertheless, the input size is really small, therefore the number of HDFS blocks will also be low. For optimal performance of HDFS the block should be approximately equal to the block size. You could repartition in Spark the distribute the data among more executors. - Fokko Driesprong