How to enable pyspark HIVE support on Google Dataproc master node

Question

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by

from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")

However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.

from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .config('spark.driver.memory', '2G')
         .config("spark.kryoserializer.buffer.max", "2000m")
         .enableHiveSupport()
         .getOrCreate())

Python version 2.7.13, Spark version 2.3.4

Any way to enable HIVE support?

You might want to see this: cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc — blackbishop
I can use hive on dataproc, but not sure how to easily access hive through jupyter. I can write a customized function to run shell command and parse the output, but it seems to be not efficient. — whatsnext

Jayadeep Jayaraman Jayadeep Jayaraman · Accepted Answer · 2020-01-10T03:57:36

Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.

How to enable pyspark HIVE support on Google Dataproc master node

2 Answers