I am using spark sql to read 2 different datasets sitting in ORC format in S3 . But the performance difference in reading is huge for almost similar size datasets.
Dataset 1 : contains 212,000,000 records each with 50 columns and total up to 15GB in orc format in s3 bucket .
Dataset 2 : contains 29,000,000 records each with 150 columns and total up to 15GB in orc format in same s3 bucket .
Dataset 1 is taking 2 mins to read using spark sql. And its taking 12 mins to read Dataset 2 with same spark read/count job in same infrastructure.
Length of each row could cause this big difference. Can anyone help me understand the reason behind huge performance difference in reading these datasets ?