0
votes

I am trying to create a Data Factory that once a week copies and process large blob files (The Source) to a SQL database (The Sink) in python - by reading the input data set line by line, extracting an ID - using that ID to do a lookup on CosmosDB to get an additional piece of data recomposing the output dataset and writing to the sink. I have a python script that does this once off (ie reads the entire blob every time) without ADF but am now wanting use the scheduling features on ADF to automate this.

Is there a way of creating a custom copy activity in Python that I can inject my current code logic into. Azure currently only documents .Net custom activities (https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity) which does not fit into my stack.

The python azure SDK doesn't currently have any documentation on creating custom activity.

1

1 Answers

2
votes

If you look at the example, you see that you can run an executable on the node.

     "typeProperties": {
          "command": "helloworld.exe",
          "folderPath": "customactv2/helloworld",
          "resourceLinkedService": {
            "referenceName": "StorageLinkedService",
            "type": "LinkedServiceReference"
          }
        }

Further down, in the differences between v1 & v2 they show just running "cmd".

cmd /c echo hello world

So if you can create an executable to kick off your python code, it might just work. You can also use parameters. However, the code will be run on Azure Batch, which provisions a VM for you. This VM might not have all the dependecies that you need. You'll have to create a "portable" package for this to work. Maybe this post can help you with that.

A bit more classy would be to trigger Azure Functions with a web activity. But is seems to be quite bèta stuff: https://ourwayoflyf.com/running-python-code-on-azure-functions-app/