0
votes

Derived column pattern Objective : We have two sample datasets with different Schema formats. Our Objective is to load different schema format files present in the source with single pipeline as it is not possible for us to segregate the source files based on their schemas. The data is loaded into a database with structure as attached. We have created a pipeline in Azure Data factory that connects to the source and loads all the csv present in the source with the derived column transformation. The source and sink both have Schema drift enabled and column pattern is used in the derived column transformation.

Issue : The pipeline works fine and maps the target column correctly if only A type or only S type files are present in the source but if both files are mixed then the mapping on the target table is not happening correctly. Schema drift is not working when source has two different source file mixed together. here is schema format defined with Table structure.

Schema A Format : Filename,Cost_Type,ResourceType,Group,Subgroup,Description,Pay Class,ResourceName,Date,Hours

Schema B Format : Filename,Cost_Type,Resource Type,Group,Sub-Group,WBS Name,Activity Name,Resource Name,Date,Hours

2
So the SQL table has all the columns in A type and B type?Joseph Xu
Yes SQL Table will have all the column of A and B like this.Charu Thareja
[Filename] [nvarchar](max) NULL, [CostType] [nvarchar](max) NULL, [ResourceType] [nvarchar](max) NULL, [Group] [nvarchar](max) NULL, [Subgroup] [nvarchar](max) NULL, [Description] [nvarchar](max) NULL, [PayClass] [nvarchar](max) NULL, [ResourceName] [nvarchar](max) NULL, [Date] [nvarchar](max) NULL, [Hours] FLOAT NULL, [WBSName] [nvarchar](max) NULL, [Activity Name] [nvarchar](max) NULLCharu Thareja

2 Answers

2
votes

Schema drift protects your ETL by automatically handling changes to your source metadata. https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-schema-drift

Your use case is different. You are trying to define a single source that has multiple different schema definitions. If you were processing Parquet source files, we could perform an implicit schemaMerge. But since you are processing CSV sources you need an alternative approach.

How many files are you processing on each pipeline execution? Can you instead send the files into the data flow activity one file at a time from a pipeline ForEach?

That approach works because the schema will be discovered on the fly by data flow upon each invocation. It is recommended that when using ForEach with a data flow activity that you minimize parallelization and use a sequential approach, which limits the scalability of this approach.

1
votes

What @Mark Kromer MSFT said is correct. We can import data from different schema format files via a ForEach activity. According to your screenshot, I guess you are using a csv files as data source.

Based on the above situation,I think we don't need to use data flow. I did an experiment as follows:
I created two csv files in Azure Data Lake gen2 container test5 and a table in Azure SQL in your format:
enter image description here
Finally I imported the two different schema csv files into a SQL table. enter image description here

In ADF:

  1. We can use Child Items at Get Metadata1 activity to get the file list. enter image description here

  2. Then we can traverse the file list at ForEach1 activity. enter image description here

  3. In the ForEach activity1, we can use Get Metadata2 to get the file structure(column name). enter image description here The file structure is as follows and the dataset is DelimitedText45 we will use this dataset next:
    enter image description here
    On the file path option of the source data, we need to key in @item().name.
    enter image description here

  4. In the If Condition activity, we can use @equals(activity('Get Metadata2').output.structure[5].name,'Description') to determine the type of file. enter image description here

  5. In the true condition, I defined a Copy activity1 and created a new dataset of the test5 container. enter image description here

And we need to import schemas and set the mapping. enter image description here

  1. In the false condition, the steps are the same. I defined a Copy activity2. enter image description here We can use the same dataset, but we need to import schemas and set the mapping again. enter image description here

  2. Then we can run debug, the data will be correctly imported into the sql table.