What is the easy and best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system?

Question

What is the best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system ? Currently, we are using Azure databricks for compute and ADLS for storage.We have a restriction to move the data into DBFS.

Already mounted ADLS in DBFS and not sure how to proceed

Hauke Mallow Hauke Mallow · Accepted Answer · 2019-06-26T11:43:47

Unfortunately in Databricks zip files are not supported, reason is that Hadoop does not have support for zip files as a compression codec. While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Spark as long as it has the right file extension, you must perform additional steps to read zip files. The sample in the Databricks documentation does the unzip on the driver node using unzip on the OS level (Ubuntu).

If your data source can' t provide the data in a compression codec supported by Spark, best method is using Azure Data Factory copy activity. Azure Data Factory supports more compression codecs, also zip is supported.

Type property definition for the source would look like this:

"typeProperties": {
        "compression": {
            "type": "ZipDeflate",
            "level": "Optimal"
        },

You can also use Azure Data Factory to orchestrate your Databricks pipelines with the Databricks activities.

What is the easy and best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system?

1 Answers