How can I create an EXTERNAL TABLE in Azure Databricks which reads from Azure Data Lake Store? I am having trouble seeing in the documentation if it is even possible. I have a set of CSV files in a specific folder in Azure Data lake Store, and I want to do a CREATE EXTERNAL TABLE in Azure Databricks which points to the CSV files.
2 Answers
1. Reference mounted directories
You can mount the Azure Data Lake Store (ADLS) to Azure Databricks DBFS (requires 4.0 runtime or higher):
# Get Azure Data Lake Store credentials from the secret store
clientid = dbutils.preview.secret.get(scope = "adls", key = "clientid")
credential = dbutils.preview.secret.get(scope = "adls", key = "credential")
refreshurl = dbutils.preview.secret.get(scope = "adls", key = "refreshurl")
accounturl = dbutils.preview.secret.get(scope = "adls", key = "accounturl")
# Mount the ADLS
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": clientid,
"dfs.adls.oauth2.credential": credential,
"dfs.adls.oauth2.refresh.url": refreshurl}
dbutils.fs.mount(
source = accounturl,
mount_point = "/mnt/adls",
extra_configs = configs)
Table creation works the same way as with DBFS. Just reference the mountpoint with the directory in ADLS, e. g.:
%sql
CREATE TABLE product
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION "/mnt/adls/productscsv/"
The location clause automatically implies EXTERNAL. See also Azure Databricks Documentation.
2. Reference the Data Lake Store in the table definition directly
You can also reference the storage directly without mounting the storage. This scenario makes sense if the metadata or parts of the code are also used in other platforms. In this scenario access to the storage has to be defined on the cluster or notebook level (see this Databricks documentation for ADLS Gen1 or this documentation for Gen2 configuration details) or Azure AD Credential Passthrough is used. The table definition would look like this for ADLS Gen1:
CREATE TABLE sampletable
(L_ORDERKEY BIGINT,
L_PARTKEY BIGINT,
L_SUPPKEY BIGINT,
L_SHIPMODE STRING,
L_COMMENT STRING)
USING csv
OPTIONS ('DELIMITER' '|')
LOCATION "adl://<your adls>.azuredatalakestore.net/directory1/sampletable"
;
For Azure Data Lake Gen2 the location reference looks like:
LOCATION "abfss://<file_system>@<account_name.dfs.core.windows.net/directory/tablename"
you should consider looking at this link: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html
Access Azure Data Lake Store using the Spark API To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}") spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}") spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")
It doesn't mention the use of External Table.