It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.
I'm using jupyter
with almond
to run spark in a notebook locally.
I have imported the hadoop dependencies:
import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`
which allows me to use the wasbs://
protocol when trying to write my dataframe to azure
spark.conf.set(
"fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net",
"?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")
This is where the error comes:
val data = spark.read.json(spark.createDataset(
"""{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))
data
.write
.orc("wasbs://[filesystem]@[datalakegen2storageaccount].blob.core.windows.net/lalalalala")
We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.
So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.
No FileSystem for scheme: abfs
, which means abfs isn't included in hadoop 2.7 :( – Moriarty Snarly