1
votes

I checked the [documentation][1] about usage of Azure Databricks external Hive Metastore (Azure SQL database).

I was able to download jars and place them into /dbfs/hive_metastore_jar

My next step is to run cluster with Init file:

# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<host>.database.windows.net:1433;database=<database> #should I add more parameters?

# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName admin

# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword p@ssword

# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver

# Spark specific configuration options
spark.sql.hive.metastore.version 2.7.3 #I am not sure about this
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars /dbfs/hive_metastore_jar 

I've uploaded ini file to the DBMS and launch cluster. It was failed to read ini. Something wrong.. [1]: https://docs.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore

1
please post error trace. most probably it's the problem with connectivity to SQL server. Check it with %sh nc -vz <host>.database.windows.net 1433 - Alex Ott
Thank you, Alex. Firewall is ok. I posted the detail issues and solution. - Dmitry Anoshin

1 Answers

2
votes

I solved this for now. The problems I faced:

  1. I didn't copy Hive jars to the local cluster. This is important, I couldn't refer to the DBMS and should refer spark.sql.hive.metastore.jars to the local copy of Hive. With INI script I can copy them.
  2. connection was good. I also used the Azure template with Vnet, it is more preferable. Then I allow traffic for Azure SQL from my Vnet with databricks.
  3. last issue - I had to create Hive schema before start databricks by copy and run DDL from Git with Hive version 1.2 I deployed it into Azure SQL Database and then I was good to go.

There is a useful notebook with steps to download jars. It is downloading jars to tmp then we should copy it to the own folder. Finally, within cluster creation we should refer to INI script that has all parameters. It has the step of copy jars from DBFS to local file system of cluster.

// This example is for an init script named `external-metastore_hive121.sh`.
dbutils.fs.put(
    "dbfs:/databricks/scripts/external-metastore_hive121.sh",
    """#!/bin/sh
      |# A temporary workaround to make sure /dbfs is available.
      |sleep 10
      |# Copy metastore jars from DBFS to the local FileSystem of every node.
      |cp -r /dbfs/metastore_jars/hive-v1_2/* /databricks/hive_1_2_1_metastore_jars
      |# Loads environment variables to determine the correct JDBC driver to use.
      |source /etc/environment
      |# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
      |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
      |[driver] {
      |    # Hive specific configuration options.
      |    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
      |    # JDBC connect string for a JDBC metastore
      |    "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://host--name.database.windows.net:1433;database=tcdatabricksmetastore_dev;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net"
      |
      |    # Username to use against metastore database
      |    "spark.hadoop.javax.jdo.option.ConnectionUserName" = "admin"
      |
      |    # Password to use against metastore database
      |    "spark.hadoop.javax.jdo.option.ConnectionPassword" = "P@ssword"
      |
      |    # Driver class name for a JDBC metastore
      |    "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
      |
      |    # Spark specific configuration options
      |    "spark.sql.hive.metastore.version" = "1.2.1"
      |    # Skip this one if ${hive-version} is 0.13.x.
      |    "spark.sql.hive.metastore.jars" = "/databricks/hive_1_2_1_metastore_jars/*"
      |}
      |EOF
      |""".stripMargin,
    overwrite = true)

The command will create a file in DBFS and we will use it as a reference for the cluster creation.

According to the documentation, we should use config:

datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false 

In order to create the Hive DDL. It didn't work for me, that's why I used git and create schema and tables myself.

You can test that all works with command:

%sql show databases