2
votes

I am conducting a performance test which compares queries on existing internal Hive tables between Spark SQL and Hive on Tez. Throughout the tests, Spark was showing query execution time that was on par or faster than Hive on Tez. These results are consistent with many of the examples out there. However, there was one noted exception with a query that involved key based selection at the individual record level. In this instance, Spark was significantly slower than Hive on Tez.

After researching this topic on the internet, I could not find a satisfactory answer and wanted to pose this example to the SO community to see if this is an individual one-off case associated with our environment or data, or a larger pattern related to Spark.

Spark 1.6.1 Spark Conf: Executors 2, Executory Memory 32G, Executor Cores 4.

Data is in an internal Hive Table which is stored as ORC file types compressed with zlib. The total size of the compressed files is ~2.2 GB.

Here is the query code.

#Python API    
#orc with zlib key based select
dforczslt = sqlContext.sql("SELECT * FROM dev.perf_test_orc_zlib WHERE test_id= 12345678987654321")
dforczslt.show()

The total time to complete this query was over 400 seconds, compared to ~6 seconds with Hive on Tez. I also tried using predicate pushdown via the SQL context configs but this resulted in no noticeable performance increase. Also, when this same test was conducted using Parquet the query time was on par with Hive as well. I'm sure there are other solutions out there to increase the performance of the queries such as using RDDS v. Dataframes etc. But I'm really looking to understand how Spark is interacting with ORC files which is resulting in this gap.

Let me know if I can provide additional clarification around any of the talking points listed above.

1

1 Answers

2
votes

The following steps might help to improve the performance of the Spark SQL query.

In general, Hive take the memory of the whole Hadoop cluster which is significantly larger than the executer memory (Here 2* 32 = 64 GB). What's the memory size of the nodes ?.

Further, the number of executers seems to be less (2) when compare to the number of number of map/reduce jobs generated by the hive query. Increasing the number of executers in multiples of 2 might help to improve the performance.

In SparkSQL and Dataframe, optimised execution using manually managed memory (Tungsten) is now enabled by default, along with code generation for expression evaluation. this features can be enabled by setting spark.sql.tungsten.enabled to true in case if it's not already enabled.

sqlContext.setConf("spark.sql.tungsten.enabled", "true")

The columnar nature of the ORC format helps to avoid reading unnecessary columns. However, But, we are still reading unnecessary rows even if the query has WHERE clause filter.ORC predicate push-down would improve the performance with it's built-in indexs. Here, the ORC predicate push-down is disabled in the Spark SQL by default and need to be explicitly enabled.

sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

I would recommend you to do some more research and find the potential performance blockers if any.