2
votes

Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:

env: CDH5.8 / spark 2.0

  1. Spark on Hive tables(as format of parquet)
  2. Spark on row files(parquet)
2
There are known issues about Scala lambdas being slower that SparkSQL expressions (that use scalar types directly, no round-trip to Objects) but it's usually marginal. And ORC vectorized reader is scheduled for Spark 2.3 if I remember well, while Parquet already has vectorization support. Other than that... I'm an old SQL user who finds Scala portmanteau expressions ridiculous, like so many sausage strings, but that's my personal opinion (set-based semantics, baby!) - Samson Scharfrichter
SparkSQL on row files(parquet or ORC). what do you mean by row files? orc is columnar storage right - loneStar

2 Answers

3
votes

Without additional context of your specific product and usecase - I'd vote for SparkSql on Hive tables for two reasons:

  1. sparksql is usually better than core spark since databricks wrote different optimizations in sparksql, which is higher abstaction and gives ability to optimize code(read about Project Tungsten). In some cases manually written spark core code will be better, but it demands from the programmer deep understanding of the internals. In addition sparksql sometimes is limited and doesn't permit you to control low-level mechanisms, but you can always fallback to work with core rdd.

  2. hive and not files - I'm assuming hive with external metastore. Metastore saves definitions of partitions of your "tables"(in files it could be some directory). This is one of the most important parts for the good performance. I.e. when working with files spark will need to load this info(which could be time consuming - e.g. s3 list operation is very slow). So metastore permits spark to fetch this info in simple and fast way

3
votes

There's only two options here. Spark on files, or Spark on Hive. SparkSQL works on both, and you should prefer to use the Dataset API, not RDD

If you can define the Dataset schema yourself, Spark reading the raw HDFS files will be faster because you're bypassing the extra hop to the Hive Metastore.

When I did a simple test myself years ago (with Spark 1.3), I noticed that extracting 100000 rows as a CSV file was orders of magnitude faster than a SparkSQL Hive query with the same LIMIT