Azure Data Factory appending large number of files having different schema from csv files

Question

We have 500 CSV files uploaded to an Azure storage container. These files use 4 different schemas, meaning that they have few different columns and some columns are common across all files.

We are using ADF and schema drift to map columns in sink and source and be able to wrote the files.

But this is not working and it only uses schema for the 1st file it processes for every file and this is causing data issues. Please advise on this issue.

We ran the pipeline for three scenarios but the issue is not resolved. Same issue as mentioned below occurring in all three cases:

1.Incorrect Mapping i.e. the Description and PayClass from A type get mapped to WBSname and Activity Name 2. If one less column in one of file (missing column) that also disturbs the mapping i.e. one files does not have resource type that maps Group incorrectly to other column.

Case 1 No Schema Drift at source and sink Empty Dummy File with all columns created and uploaded at source Derived table with column Pattern

Case 2 : Schema Drift at source and sink Dummy File with all columns created and uploaded at source Derived table with column Pattern

Case 3 :Schema Drift at Source /No Schema Drift at Sink Dummy File with all columns created and uploaded at source Derived table with column Pattern

Mark Kromer MSFT Mark Kromer MSFT · Accepted Answer · 2020-11-26T21:52:03

This is because you have different schemas inside the files being read by the single source transformation.

Schema Drift will automatically handle occasions when that source's schema changes over different invocations from your pipeline.

The way to solve this in your case is to have 4 sources: 1 for each of your CSV schema types. You can always Union the results back together into a single stream and sink them together at the end.

If you use schema drift in this scenario with 4 different source types, data flow will automatically handle cases where more columns are found and columns change per pipeline execution of this data flow.

BTW, this schemaMerge feature you're asking for is available today with Parquet sources in ADF's data flow. We're working on adding native schemaMerge to CSV sources. Until then, you'll need to use an approach like the one I described above.

Azure Data Factory appending large number of files having different schema from csv files

1 Answers