Azure Data Factory - extracting information from Data Lake Gen 2 JSON files

Question

I have an ADF pipeline loading raw log data as JSON files into a Data Lake Gen 2 container.

We now want to extract information from those JSON files and I am trying to find the best way to get information from said files. I found that Azure Data Lake Analytics and U-SQL scripts are pretty powerful and also cheap, but they require a steep learning curve.

Is there a recommended way to parse JSON files and extract information from them? Would Data Lake tables be an adequate storage for this extracted information and act then as a source for downstream reporting process?

And finally, will Azure Data Factory ever be able to parse nested arrays JSONs?

I couldn't recommend U-SQL in 2021 plus its JSON manipulation was never that great. Please consider using ADF to call something external which can manipulate JSON like Databricks notebook, Synapse notebook OR Mapping Data Flows for a low-code experience as nicely outlined by Joseph. One other alternative is Azure Synapse Analytics which has a serverless database with JSON function support. This can work with JSON as described here. It might be useful for you to post some sample JSON with expected results too. — wBob

Joseph Xu Joseph Xu · Accepted Answer · 2021-02-18T01:50:23

We can parse JSON files and extract information via data flow. We can parse nested arrays JSONs via Flatten transformation in mapping data flow.

Json example:

    {   
        "count": 1,
        "value": [{
                    "obj": 123,
                    "lists": [{
                                "employees": [{
                                    
                                        "name": "",
                                        "id": "001",
                                        "tt_1": 0,
                                        "tt_2": 4,
                                        "tt3_": 1
                                    },
                                    {
                                        "name": "",
                                        "id": "002",
                                        "tt_1": 10,
                                        "tt_2": 8,
                                        "tt3_": 1
                                    }]
                            }]
                    }]                  
    }

Flatten active settings and output preview:

Mapping data flow follows an extract, load, and transform (ELT) approach and works with staging datasets that are all in Azure. Currently, the following datasets can be used in a source transformation.

So I think using data flow in ADF is the easiest way to extract information and act then as a source for downstream reporting process.

Azure Data Factory - extracting information from Data Lake Gen 2 JSON files

1 Answers