2
votes

I once used Datapoc(image version 1.1) with Zeppelin 0.62 to create hive tables stored in Google Cloud Bucket. Now I created another Dataproc version 1.2 which uses Zeppelin 0.71 by following https://zeppelin.apache.org/docs/0.7.1/interpreter/spark.html. Once every external component(Hive metastore on MySQL server, Zeppelin) was initialized completely I queried all hive tables using

%sql 
show tables

but tables created from previous version of Dataproc were not returned. I rechecked the initialization scripts of zeppelin.sh and cloud-sql-proxy.sh and they were correct. I then rechecked the value of hive.metastore.warehouse.dir and it matched the one used in the previous version of Dataproc but this time Spark 2.2.0 changed to spark.sql.warehouse.dir instead (see https://issues.apache.org/jira/browse/SPARK-15034).

I then created a new hive table, table_zeppelin, and the content was stored in the bucket correctly. When I verified it by show tables the table showed up as expected. But once I restarted Zeppelin and reran show tables I got nothing back. Strange.. because the content of table_zeppelin was already in the bucket. Once I verified the table TBLS in MySQL instance that stores hive metastore I didn't see table_zeppelin. I guess there's something wrong with hive metastore.

Surprisingly, when I created another hive table, table_spark but this time via spark-shell everything worked as expected. When I ran show tables I got table_spark and all the tables created in the previous Dataproc version but not table_zeppelin previously created via Zeppelin 0.71. table_spark was also in table TBLS of MySQL instance. I'm quite certain there's something wrong with setting hive metastore in Zeppelin 0.71 as Zeppelin can't read/write anything to the metastore. I can confirm that SPARK_HOME was set correctly in zeppelin-env.sh to point to Dataproc Spark.

Here's my cluster creation script:

gcloud dataproc --region us-west1 clusters create coco-cluster --bucket rcom_dataproc_dev --zone us-west1-a --master-machine-type n1-highmem-4 --master-boot-disk-size 500 --num-workers 3 --worker-machine-type n1-highcpu-8 --worker-boot-disk-size 500 --image-version 1.2 --project true-dmp --initialization-actions 'gs://dmp_recommendation_dev/env_dependencies/cloud-sql-proxy.sh','gs://dmp_recommendation_dev/env_dependencies/zeppelin.sh' --scopes cloud-platform --properties hive:hive.metastore.warehouse.dir=gs://rcom_dataproc_dev/hive-warehouse --metadata "hive-metastore-instance=true-dmp:asia-northeast1:rcom-metastore-sql,hive-metastore-db=hive_metastore_dev"

Note MySQL instance storing hive metastore is in Asia but the cluster is in the US. I don't think that's the cause of this.

So my question is how can I set Zeppelin 0.71 to recognize Hive Metastore which is in Google Cloud SQL instance?

Thank you
Peeranat F.

1

1 Answers

2
votes

Thanks for your detailed repro -- this was fixed in (unreleased) Zeppelin 0.8: https://issues.apache.org/jira/browse/ZEPPELIN-2377.

We'll backport this fix into our packages and edit this post in the next couple weeks when it rolls out.

In the mean time, spark-shell/spark-sql/spark-submit on the command line and spark/spark-sql through the Dataproc API should still work.