I once used Datapoc(image version 1.1) with Zeppelin 0.62 to create hive tables stored in Google Cloud Bucket. Now I created another Dataproc version 1.2 which uses Zeppelin 0.71 by following https://zeppelin.apache.org/docs/0.7.1/interpreter/spark.html. Once every external component(Hive metastore on MySQL server, Zeppelin) was initialized completely I queried all hive tables using
%sql
show tables
but tables created from previous version of Dataproc were not returned. I rechecked the initialization scripts of zeppelin.sh and cloud-sql-proxy.sh and they were correct. I then rechecked the value of hive.metastore.warehouse.dir and it matched the one used in the previous version of Dataproc but this time Spark 2.2.0 changed to spark.sql.warehouse.dir instead (see https://issues.apache.org/jira/browse/SPARK-15034).
I then created a new hive table, table_zeppelin, and the content was stored in the bucket correctly. When I verified it by show tables the table showed up as expected. But once I restarted Zeppelin and reran show tables I got nothing back. Strange.. because the content of table_zeppelin was already in the bucket. Once I verified the table TBLS in MySQL instance that stores hive metastore I didn't see table_zeppelin. I guess there's something wrong with hive metastore.
Surprisingly, when I created another hive table, table_spark but this time via spark-shell everything worked as expected. When I ran show tables I got table_spark and all the tables created in the previous Dataproc version but not table_zeppelin previously created via Zeppelin 0.71. table_spark was also in table TBLS of MySQL instance. I'm quite certain there's something wrong with setting hive metastore in Zeppelin 0.71 as Zeppelin can't read/write anything to the metastore. I can confirm that SPARK_HOME was set correctly in zeppelin-env.sh to point to Dataproc Spark.
Here's my cluster creation script:
gcloud dataproc --region us-west1 clusters create coco-cluster --bucket rcom_dataproc_dev --zone us-west1-a --master-machine-type n1-highmem-4 --master-boot-disk-size 500 --num-workers 3 --worker-machine-type n1-highcpu-8 --worker-boot-disk-size 500 --image-version 1.2 --project true-dmp --initialization-actions 'gs://dmp_recommendation_dev/env_dependencies/cloud-sql-proxy.sh','gs://dmp_recommendation_dev/env_dependencies/zeppelin.sh' --scopes cloud-platform --properties hive:hive.metastore.warehouse.dir=gs://rcom_dataproc_dev/hive-warehouse --metadata "hive-metastore-instance=true-dmp:asia-northeast1:rcom-metastore-sql,hive-metastore-db=hive_metastore_dev"
Note MySQL instance storing hive metastore is in Asia but the cluster is in the US. I don't think that's the cause of this.
So my question is how can I set Zeppelin 0.71 to recognize Hive Metastore which is in Google Cloud SQL instance?
Thank you
Peeranat F.