Interactive pyspark session launched directly on GCP dataproc cluster errors about default table HIVE

Question

With pyspark on GCP, I am sometimes getting messages like

AnalysisException: "Database 'default' not found;"

From the research I've done, I understand this relates to hive tables. Maybe I am supposed to explicitly tell spark where the hive.xml file is. I see I have this file

./etc/hive/conf.dist/hive-site.xml

and some other files that might be important are

./usr/local/share/google/dataproc/bdutil/conf/hive-template.xml
./usr/local/share/google/dataproc/bdutil/conf/hive-ha-mixins.xml
./etc/hive-hcatalog/conf.dist/proto-hive-site.xml

I'm getting into pyspark the same way I did on AWS. I'm ssh-ing to the cluster, and constructing my own spark-submit command. It starts out like

 export PYTHONPATH=/path/to/my/stuff:$PYTHONPATH
 export PYSPARK_PYTHON=/usr/bin/python3
 export PYSPARK_DRIVER_PYTHON=/usr/local/bin/ipython3
 pyspark --class main --master yarn --deploy-mode client  --conf spark.speculation=True

that is, I'm creating an interactive pyspark session with ipython directly on the master node of the dataproc cluster.

I don't have any special options about enabling hive or locating hive. I'm not explicitly using hive, I'm just reading parquet files and all my SQL is through the pyspark interface, things like

df = spark.read.parquet('gs://path/to/my/data')
df.groupBy('something').count().show()

Is this the wrong way to create an interactive pyspark session on a dataproc cluster? I have found documentation, like https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/pyspark that goes over how to submit jobs from your laptop - but I haven't seen anything about starting an interactive session on the cluster. I'm worried that the gcloud dataproc jobs submit pyspark adds special options and configuration that I'm missing.

As you're running it using iPython it can be interesting to look at the Datalab initialization action and the env variables that defines: script — Guillem Xercavins
No, recent tasks haven't involved GCP that much, but I'm sure I'll hit it again when I'm working alot there - or maybe the below answer will help — MrCartoonology

Karthik Palaniappan Karthik Palaniappan · Accepted Answer · 2019-01-27T22:39:13

+1 for using a notebook, rather than spark-shell. Many of the initialization actions have graduated to optional components, which are faster to install:

Jupyter notebook (pyspark): https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/optional-components#jupyter_notebook
Zeppelin notebook (multi-language, including SQL): https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/optional-components#zeppelin_notebook.

If you use --image-version=preview (which will eventually be --image-version=1.4), you will get Python 3 and conda/pip by default. That way you don't need to run any scripts to set up Python 3.

Installing Anaconda via --optional-components ANACONDA is also convenient because it comes with a lot of common data science packages.

Just note that setting up an SSH tunnel to view the web interface is a little tricky. Here's the tl;dr of that doc:

# Terminal 1: Run an SSH tunnel on port 1080
gcloud compute ssh clustername-m -- -nND 1080

# Terminal 2: Run Chrome (on Linux -- the command is different for a Mac) using the proxy on port 1080. Make sure you don't have any corporate proxy settings that might interfere with using the proxy on 1080.
/usr/bin/google-chrome --proxy-server="socks5://localhost:1080" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp/master-host-name

Back to your original question: we configure Spark's classpath to include Hive configuration (look in /etc/spark/conf/spark-env.sh). So the pyspark, spark-shell, and spark-submit commands should already be set up correctly without needing any arguments. The code snippets you pasted don't really touch Hive (aka you're not reading or writing Hive tables), so I'm not sure why you're hitting that error message.

Interactive pyspark session launched directly on GCP dataproc cluster errors about default table HIVE

1 Answers