Spark on Parquet vs Spark on Hive(Parquet format)

Question

Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:

env: CDH5.8 / spark 2.0

Spark on Hive tables(as format of parquet)
Spark on row files(parquet)

There are known issues about Scala lambdas being slower that SparkSQL expressions (that use scalar types directly, no round-trip to Objects) but it's usually marginal. And ORC vectorized reader is scheduled for Spark 2.3 if I remember well, while Parquet already has vectorization support. Other than that... I'm an old SQL user who finds Scala portmanteau expressions ridiculous, like so many sausage strings, but that's my personal opinion (set-based semantics, baby!) — Samson Scharfrichter
SparkSQL on row files(parquet or ORC). what do you mean by row files? orc is columnar storage right — loneStar

Igor Berman Igor Berman · Accepted Answer · 2017-11-09T18:58:29

Without additional context of your specific product and usecase - I'd vote for SparkSql on Hive tables for two reasons:

sparksql is usually better than core spark since databricks wrote different optimizations in sparksql, which is higher abstaction and gives ability to optimize code(read about Project Tungsten). In some cases manually written spark core code will be better, but it demands from the programmer deep understanding of the internals. In addition sparksql sometimes is limited and doesn't permit you to control low-level mechanisms, but you can always fallback to work with core rdd.
hive and not files - I'm assuming hive with external metastore. Metastore saves definitions of partitions of your "tables"(in files it could be some directory). This is one of the most important parts for the good performance. I.e. when working with files spark will need to load this info(which could be time consuming - e.g. s3 list operation is very slow). So metastore permits spark to fetch this info in simple and fast way

Spark on Parquet vs Spark on Hive(Parquet format)

2 Answers