I'm trying to cache a Hive Table in memory using CACHE TABLE tablename;
After this command, the table gets successfully cached however i noticed a skew in the way the RDD in partitioned in memory.
Here's what i see in the "Storage" tab on the application master
rdd_71_1 Memory Deserialized 1x Replicated 1264.7 MB 0.0 B node4:38759
rdd_71_10 Memory Deserialized 1x Replicated 11.6 MB 0.0 B node1:58115
rdd_71_11 Memory Deserialized 1x Replicated 25.7 MB 0.0 B node1:53968
rdd_71_2 Memory Deserialized 1x Replicated 72.6 MB 0.0 B node4:54133
rdd_71_4 Memory Deserialized 1x Replicated 1260.9 MB 0.0 B node2:33179
rdd_71_5 Memory Deserialized 1x Replicated 56.8 MB 0.0 B node2:54222
rdd_71_7 Memory Deserialized 1x Replicated 54.5 MB 0.0 B node4:34149
rdd_71_8 Memory Deserialized 1x Replicated 1277.8 MB 0.0 B node1:43572
rdd_71_9 Memory Deserialized 1x Replicated 1255.8 MB 0.0 B node1:58518
Notice some partitions are of the range of 11MB to 72MB and other partitions are of the range ~1200MB
Even when i'm not caching the table, but just simply processing from disk, i see that some tasks complete MUCH earlier than others which further confirms my guess about skewness.
Whats going on here? How can i avoid this data skew?
PS : The table is stored in the ORC format