I am conducting a performance test which compares queries on existing internal Hive tables between Spark SQL and Hive on Tez. Throughout the tests, Spark was showing query execution time that was on par or faster than Hive on Tez. These results are consistent with many of the examples out there. However, there was one noted exception with a query that involved key based selection at the individual record level. In this instance, Spark was significantly slower than Hive on Tez.
After researching this topic on the internet, I could not find a satisfactory answer and wanted to pose this example to the SO community to see if this is an individual one-off case associated with our environment or data, or a larger pattern related to Spark.
Spark 1.6.1 Spark Conf: Executors 2, Executory Memory 32G, Executor Cores 4.
Data is in an internal Hive Table which is stored as ORC file types compressed with zlib. The total size of the compressed files is ~2.2 GB.
Here is the query code.
#Python API
#orc with zlib key based select
dforczslt = sqlContext.sql("SELECT * FROM dev.perf_test_orc_zlib WHERE test_id= 12345678987654321")
dforczslt.show()
The total time to complete this query was over 400 seconds, compared to ~6 seconds with Hive on Tez. I also tried using predicate pushdown via the SQL context configs but this resulted in no noticeable performance increase. Also, when this same test was conducted using Parquet the query time was on par with Hive as well. I'm sure there are other solutions out there to increase the performance of the queries such as using RDDS v. Dataframes etc. But I'm really looking to understand how Spark is interacting with ORC files which is resulting in this gap.
Let me know if I can provide additional clarification around any of the talking points listed above.