How to configure Hive to use Spark execution engine on Google Dataproc?

Question

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.

Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.

Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.

Patrick Clay Patrick Clay · Accepted Answer · 2017-04-11T00:27:44

This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.

If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.

If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

How to configure Hive to use Spark execution engine on Google Dataproc?

1 Answers