2
votes

How can I create an EXTERNAL TABLE in Azure Databricks which reads from Azure Data Lake Store? I am having trouble seeing in the documentation if it is even possible. I have a set of CSV files in a specific folder in Azure Data lake Store, and I want to do a CREATE EXTERNAL TABLE in Azure Databricks which points to the CSV files.

2

2 Answers

3
votes

1. Reference mounted directories

You can mount the Azure Data Lake Store (ADLS) to Azure Databricks DBFS (requires 4.0 runtime or higher):

    # Get Azure Data Lake Store credentials from the secret store
    clientid = dbutils.preview.secret.get(scope = "adls", key = "clientid")
    credential = dbutils.preview.secret.get(scope = "adls", key = "credential")
    refreshurl = dbutils.preview.secret.get(scope = "adls", key = "refreshurl")
     accounturl = dbutils.preview.secret.get(scope = "adls", key = "accounturl")

    # Mount the ADLS
    configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
       "dfs.adls.oauth2.client.id": clientid,
       "dfs.adls.oauth2.credential": credential,
       "dfs.adls.oauth2.refresh.url": refreshurl}

    dbutils.fs.mount(
       source = accounturl,
       mount_point = "/mnt/adls",
       extra_configs = configs)

Table creation works the same way as with DBFS. Just reference the mountpoint with the directory in ADLS, e. g.:

    %sql 
    CREATE TABLE product
    USING CSV
    OPTIONS (header "true", inferSchema "true")
    LOCATION "/mnt/adls/productscsv/"

The location clause automatically implies EXTERNAL. See also Azure Databricks Documentation.

2. Reference the Data Lake Store in the table definition directly

You can also reference the storage directly without mounting the storage. This scenario makes sense if the metadata or parts of the code are also used in other platforms. In this scenario access to the storage has to be defined on the cluster or notebook level (see this Databricks documentation for ADLS Gen1 or this documentation for Gen2 configuration details) or Azure AD Credential Passthrough is used. The table definition would look like this for ADLS Gen1:

CREATE TABLE sampletable
(L_ORDERKEY BIGINT,
 L_PARTKEY BIGINT,
 L_SUPPKEY BIGINT,
 L_SHIPMODE STRING,
 L_COMMENT STRING)
 USING csv
 OPTIONS ('DELIMITER' '|')
 LOCATION "adl://<your adls>.azuredatalakestore.net/directory1/sampletable"
; 

For Azure Data Lake Gen2 the location reference looks like:

 LOCATION "abfss://<file_system>@<account_name.dfs.core.windows.net/directory/tablename"
0
votes

you should consider looking at this link: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html

Access Azure Data Lake Store using the Spark API To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}") spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}") spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")

It doesn't mention the use of External Table.