downloaded bytes from s3 of spark sql is multiple times more than hive sql

Question

i have a hive table on aws s3 which contains 144 csv formatted files(20M per file) and total size is 3G;

when i execute sql by spark sql, it cost 10-15G downloaded bytes(not same everytime, counted by aws service), much more then hive table size; but when i execute same sql by hive on hive client, the downloaded bytes is equal to hive table size on s3;

sql is simple like 'select count(1) from #table#';

from spark ui stages tag, there is almost 2k+ task, much greater than spark rdd read execution; so one file is accessed by multiple tasks?

any help will be appreciated!

paul paul · Accepted Answer · 2018-08-27T12:45:06

this is because spark will split one file into multi partitions(each partition refer to one task), even file size is less then block.size(64M or 128M);

so in order to decrease map task number, you can decrease conf 'mapreduce.job.maps' (default valued 2, work for csv but not orc format, changed to 80 in my mapred-site.xml);

downloaded bytes from s3 of spark sql is multiple times more than hive sql

1 Answers