i have a hive table on aws s3 which contains 144 csv formatted files(20M per file) and total size is 3G;
when i execute sql by spark sql, it cost 10-15G downloaded bytes(not same everytime, counted by aws service), much more then hive table size; but when i execute same sql by hive on hive client, the downloaded bytes is equal to hive table size on s3;
sql is simple like 'select count(1) from #table#';
from spark ui stages tag, there is almost 2k+ task, much greater than spark rdd read execution; so one file is accessed by multiple tasks?
any help will be appreciated!
