Apache Spark SQL - RDD In-Memory Data Skew

Question

I'm trying to cache a Hive Table in memory using CACHE TABLE tablename;

After this command, the table gets successfully cached however i noticed a skew in the way the RDD in partitioned in memory.

Here's what i see in the "Storage" tab on the application master

rdd_71_1    Memory Deserialized 1x Replicated   1264.7 MB   0.0 B   node4:38759
rdd_71_10   Memory Deserialized 1x Replicated   11.6 MB     0.0 B   node1:58115
rdd_71_11   Memory Deserialized 1x Replicated   25.7 MB     0.0 B   node1:53968
rdd_71_2    Memory Deserialized 1x Replicated   72.6 MB     0.0 B   node4:54133
rdd_71_4    Memory Deserialized 1x Replicated   1260.9 MB   0.0 B   node2:33179
rdd_71_5    Memory Deserialized 1x Replicated   56.8 MB     0.0 B   node2:54222
rdd_71_7    Memory Deserialized 1x Replicated   54.5 MB     0.0 B   node4:34149
rdd_71_8    Memory Deserialized 1x Replicated   1277.8 MB   0.0 B   node1:43572
rdd_71_9    Memory Deserialized 1x Replicated   1255.8 MB   0.0 B   node1:58518

Notice some partitions are of the range of 11MB to 72MB and other partitions are of the range ~1200MB

Even when i'm not caching the table, but just simply processing from disk, i see that some tasks complete MUCH earlier than others which further confirms my guess about skewness.

Whats going on here? How can i avoid this data skew?

PS : The table is stored in the ORC format

I guess cache table is performing temporary caches to increase computing performance — eliasah

Glennie Helles Sindholt Glennie Helles Sindholt · Accepted Answer · 2015-09-03T10:59:34

I don't exactly know why your data is skewed when read directly from disk. However, I find that it is frequently useful to repartition your data to balance the size of the partitions and avoid being held up by a single long lasting task. I recommend reading the last part of https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html (the "Data Partitioning (Advanced)" section) which offers some nice tips :)

Apache Spark SQL - RDD In-Memory Data Skew

1 Answers