1
votes

I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.

Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.

Any insight appreciated - a Google turned up little.

Picture of the transient folder

1

1 Answers

1
votes

So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.

For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.

Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.

I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.

One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!

Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.

Hope this helps.