With pyspark on GCP, I am sometimes getting messages like
AnalysisException: "Database 'default' not found;"
From the research I've done, I understand this relates to hive tables. Maybe I am supposed to explicitly tell spark where the hive.xml
file is. I see I have this file
./etc/hive/conf.dist/hive-site.xml
and some other files that might be important are
./usr/local/share/google/dataproc/bdutil/conf/hive-template.xml
./usr/local/share/google/dataproc/bdutil/conf/hive-ha-mixins.xml
./etc/hive-hcatalog/conf.dist/proto-hive-site.xml
I'm getting into pyspark the same way I did on AWS. I'm ssh-ing to the cluster, and constructing my own spark-submit
command. It starts out like
export PYTHONPATH=/path/to/my/stuff:$PYTHONPATH
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/ipython3
pyspark --class main --master yarn --deploy-mode client --conf spark.speculation=True
that is, I'm creating an interactive pyspark session with ipython directly on the master node of the dataproc cluster.
I don't have any special options about enabling hive or locating hive. I'm not explicitly using hive, I'm just reading parquet files and all my SQL is through the pyspark interface, things like
df = spark.read.parquet('gs://path/to/my/data')
df.groupBy('something').count().show()
Is this the wrong way to create an interactive pyspark session on a dataproc cluster? I have found documentation, like https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/pyspark that goes over how to submit jobs from your laptop - but I haven't seen anything about starting an interactive session on the cluster. I'm worried that the gcloud dataproc jobs submit pyspark
adds special options and configuration that I'm missing.