Azure Data Factory deflate without creating a folder

Question

I have a Data Factory v2 job which copies files from an SFTP server to an Azure Data Lake Gen2.

There is a mix of .csv files and .zip files (each containing only one csv file).

I have one dataset for copying the csv files and another for copying zip files (with Compressoin type set to ZipDeflate). The problem is that the ZipDeflate creates a new folder that contains the csv file and I need this to respect the folder hierarchy without creating any folders.

Is this possible in Azure Data Factory?

Alex KeySmith Alex KeySmith · Accepted Answer · 2019-06-04T22:01:02

Good question, I ran into similar trouble* and it doesn't seem to be well documented.

If I remember correctly Data Factory assumes ZipDeflate could contain more than one file and appears to create a folder no matter what.

If you have Gzip files on the other hand which only have a single file, then it will create only that.

You'll probably already know this bit, but having it in the forefront of your mind helped me realise the sensible default data factory has:

My understanding of it is that the Zip standard is an archive format which is happening to use the Deflate algorithm. Being an archive format it naturally can contain multiple files.

Whereas gzip (for example) is just the compression algorithm it doesn't support multiple files (unless tar archived first), so it will decompress to just a file without a folder.

You could have an additional data factory step to take the hierarchy and copy it to a flat folder perhaps, but that leads to random file names (which you may or may not be happy with). For us it didn't work as our next step in the pipeline needed predictable filenames.

n.b. Data factory does not move files it copies them so if they're very large this could be a pain. You can trigger a meta data move operation via the data lake store API or Powershell etc however.

*Mine was slightly crazier situation in that I was receiving files named .gz from a source system but were in fact zip files in disguise! In the end the best option was to ask our source system to change to true gzip files.

Azure Data Factory deflate without creating a folder

1 Answers