Why are there a mass of tasks for loading a CSV file in S3 bucket?

Question

i have a small spark standalone cluster with dynamic resource allocation which uses aws s3 as storage, then i start a spark sql, create a hive external table loading data from a 779.3KB csv file in s3 bucket, when i execute a sql "select count(1) from sales;", there exactly are 798009 tasks in the spark sql job, just like a task per byte. And "spark.default.parallelism" doesn't work. Is there any advice?

stevel stevel · Accepted Answer · 2019-01-21T15:42:46

If you are using Hadoop 2.6 JARs then it's a bug in that version of s3a; if you are seeing it elsewhere then it may be a config problem.

Your file is being split into one partition per byte because the filesystem is saying "each partition is one byte long". Which means that FileSystem.getBlockSize() is returning the value "0" (cf. HADOOP-11584: s3a file block size set to 0 in getFileStatus).

For s3a connector, make sure that you are using 2.7+ and then set fs.s3a.block.size to something like 33554432 (i.e. 32MB), at which point your source file won't get split up at all.

If you can go up to 2.8; we've done a lot of work speeding both input and output, especially with column format IO and its seek patterns.

Why are there a mass of tasks for loading a CSV file in S3 bucket?

3 Answers