Using ADF REST connector to read and transform FHIR data

Question

I am trying to use Azure Data Factory to read data from a FHIR server and transform the results into newline delimited JSON (ndjson) files in Azure Blob storage. Specifically, if you query a FHIR server, you might get something like:

{
    "resourceType": "Bundle",
    "id": "som-id",
    "type": "searchset",
    "link": [
        {
            "relation": "next",
            "url": "https://fhirserver/?ct=token"
        },
        {
            "relation": "self",
            "url": "https://fhirserver/"
        }
    ],
    "entry": [
        {
            "fullUrl": "https://fhirserver/Organization/1234",
            "resource": {
                "resourceType": "Organization",
                "id": "1234",
                // More fields
        },
        {
            "fullUrl": "https://fhirserver/Organization/456",
            "resource": {
                "resourceType": "Organization",
                "id": "456",
                // More fields
        },

        // More resources
    ]
}

Basically a bundle of resources. I would like to transform that into a newline delimited (aka ndjson) file where each line is just the json for a resource:

{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources

I am able to get the REST connector set up and it can query the FHIR server (including pagination), but no matter what I try I cannot seem to generate the ouput I want. I set up an Azure Blob storage dataset:

{
    "name": "AzureBlob1",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "type": "AzureBlob",
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "setOfObjects"
            },
            "fileName": "myout.json",
            "folderPath": "outfhirfromadf"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

And configure a copy activity:

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Copy Data1",
                "type": "Copy",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "RestSource",
                        "httpRequestTimeout": "00:01:40",
                        "requestInterval": "00.00:00:00.010"
                    },
                    "sink": {
                        "type": "BlobSink"
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "schemaMapping": {
                            "resource": "resource"
                        },
                        "collectionReference": "$.entry"
                    }
                },
                "inputs": [
                    {
                        "referenceName": "FHIRSource",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "AzureBlob1",
                        "type": "DatasetReference"
                    }
                ]
            }
        ]
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

But at the end (in spite of configuring the schema mapping), it the end result in the blob is always just the original bundle returned from the server. If I configure the output blob as being a comma delimited text, I can extract fields and create a flattened tabular view, but that is not really what I want.

Any suggestions would be much appreciated.

My experience so far with Azure Data Factory's Copy Activity is that it only copy data from place to another and everytime I needed some sort of transformation, I got hurt :) Would that be acceptable for you to consider Databricks and use some python/scala scripting to do the transformation you need? — Kzrystof
@Kzrystof, thanks for your comment. That would be one possible option. I guess I was trying to see how far I could get with ADF on this path. So yes, definitely an option, but I would also like to know if this is (or will be) possible with ADF. — MichaelHansen
Oh I understand you mean :) I actually asked a similar question few weeks ago about how a Copy Activity could exclude rows that wouldn't fit a certain criteria... — Kzrystof

Kzrystof Kzrystof · Accepted Answer · 2019-04-25T22:40:55

As briefly discussed in the comment, the Copy Activity does not provide much functionality aside from mapping data. As stated in the documentation, the Copy activity does the following operations:

Reads data from a source data store.

Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity.

Writes data to the sink/destination data store.

It does not look like that the Copy Activity does anything else aside from efficiently copying stuff around.

What I found out to be working was to use Databrick.

Here are the steps:

Add a Databricks account to your subscription;
Go to the Databricks page by clicking the authoring button;
Create a notebook;
Write the script (Scala, Python or .Net was recently announced).

The script would the following:

Read the data from the Blob storage;
Filter out & transform the data as needed;
Write the data back to a Blob storage;

You can test your script from there and, once ready, you can go back to your pipeline and create a Notebook activity that will point to your notebook containing the script.

I struggled coding in Scala but it was worth it :)

Using ADF REST connector to read and transform FHIR data

3 Answers