Can Spark write to Azure Datalake Gen2?

Question

It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.

I'm using jupyter with almond to run spark in a notebook locally.

I have imported the hadoop dependencies:

import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`

which allows me to use the wasbs:// protocol when trying to write my dataframe to azure

    spark.conf.set(
        "fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net", 
        "?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")

This is where the error comes:

val data = spark.read.json(spark.createDataset(
  """{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))

data
  .write
  .orc("wasbs://[filesystem]@[datalakegen2storageaccount].blob.core.windows.net/lalalalala")

We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:

org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.

So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.

Yes it is possible using the ABFS driver which is included in all major hadoop distributions: docs.microsoft.com/en-us/azure/storage/blobs/… — silent
That's good to know - although Spark still says No FileSystem for scheme: abfs, which means abfs isn't included in hadoop 2.7 :( — Moriarty Snarly

stevel stevel · Accepted Answer · 2019-11-18T13:08:42

Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client". Those in HD/Insights, Cloudera CDH6.x etc do.

consistently upgrade the hadoop-* JARs to Hadoop 3.2.1. That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.
use abfs:// URLs
Configure the client as per the docs.

ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. Security and permissions are great too.

Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.

Can Spark write to Azure Datalake Gen2?

2 Answers