2
votes

I have complex XML files with nested elements. I built a process to handle using SSIS and T-SQL. We utilize Azure Data Factory and I'd like to explore converting XML files to JSON or CSV, since those are supported by ADF and XML is not.

It appears logic apps is one option. Has anyone had other luck with taking XML and converting within a pipeline?

Current Workflow: pick up XML files from folder, drop to on network drives, bulk insert XML into a staging row, parse XML to various SQL tables for analysis.

Sample:

<HEADER>
<SurveyID> 1234 </SURVEYID>
  <RESPONSES>
      <VAR>Question1</VAR>
      <VALUE>Answer1</VALUE>
  </RESPONSES>
  <RESPONSES>
      <VAR>Question2</VAR>
      <VALUE>Answer2</VALUE>
  </RESPONSES>
<SurveyID> 1234 </SURVEYID>
 <RESPONSES>
      <VAR>Question1</VAR>
      <VALUE>DifferentAnswer</VALUE>
  </RESPONSES>
</HEADER>

Note: I don't need to know how to parse XML, that is done. I also know that you can execute SSIS within ADF. I am looking at alternatives to the overall process.

1
Yes, Azure Logic App and its Transform XML action is an option. It supports Liquid, XSLT 1.0, XSLT 2.0, or XSLT 3.0. XSLT 3.0 has output="JSON". Amazingly enough, Microsoft Azure Data Factory cannot process XML. It is a glaring omission. I know that they are working to introduce such functionality.Yitzhak Khabinsky

1 Answers

5
votes

I'm not sure why this question got downvoted - I had a similar need a few months ago. The situation was exacerbated by the fact the XML we receive is poorly formatted and wouldn't even parse correctly. To solve this, I wrote a .NET console app and deployed it to Azure Batch. It reads the XML from Blob Storage, corrects the formatting errors, then parses the XML and outputs it to a JSON file back in Blob Storage. ADF supports Azure Batch through the "Custom" activity, and so this plugs right into our pipeline. Depending on your data structure, you could output it to CSV if that is more appropriate.

The tricky bits of using Azure Batch from ADF are in passing and processing parameter data. In the ADF configuration, these are listed under "Extended properties": enter image description here

These properties are available to the Batch job at runtime in a JSON file named "activity.json":In the Console app, you will need to access the JSON file to read the extended properties:

var activity_json = File.ReadAllText("activity.json");
dynamic activity = JsonConvert.DeserializeObject(activity_json);
            
parameters.Add("alertId", activity.typeProperties.extendedProperties.AlertId.ToString());
parameters.Add("hashKey", activity.typeProperties.extendedProperties.HashKey.ToString());
parameters.Add("startTime", activity.typeProperties.extendedProperties.StartTime.ToString());
parameters.Add("endTime", activity.typeProperties.extendedProperties.EndTime.ToString());

The property names are Case Sensitive. [Note that in this example I am writing them to a "parameters" Dictionary - I do that so I can run the Console app either locally or in Azure Batch.] There are a few other "interesting" aspects to using Azure Batch, but this is the biggest hurdle in my opinion.