Spark dataframe(in Azure Databricks) save in single file on data lake(gen2) and rename the file

Question

I am trying to achieve the same functionality as this SO post Spark dataframe save in single file on hdfs location except my file is located in Azure Data Lake Gen2, and I am using pyspark in Databricks notebook.

Below is the code snippet I am using to rename the file

from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')

destpath = "abfss://" + contianer + "@" + storageacct + ".dfs.core.windows.net/"
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()
#Rename the file

I receive an IndexError: list index out of range on this line

file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()

The part* file does exist in the folder.

1) Is this the right approach to rename file that databricks(pyspark) writes to Azure DataLakeGen2, if not, how else can I accomplish this?

sab sab · Accepted Answer · 2020-01-25T01:03:45

I was able to resolve this by installing the azure.storage.filedatalake client library in my databricks notebook. By using the FileSystemClient class and DataLakeFileClient class, I was able to rename the file in data lake gen2.

Spark dataframe(in Azure Databricks) save in single file on data lake(gen2) and rename the file

1 Answers