1
votes

I'm a little bit lost regarding the Azure data factory.

My goal is to retrieve data from our company postgresql database and transform it into an Azure SQL database (which can later be accessed by Power BI).

I created a pipeline and tried to use "Data Flow" as an activity. However, I cannot select my postgresql database as a source there. Valid sources are for example other Azure SQL databases.

So I tried to perform the "Copy Data" activity to copy data from the postgresql database into an Azure SQL database (1) and afterwards transform the data using a "Data Flow" into an Azure SQL database (2) (which has a different table structure).

Is that a good and valid approach?

There are a couple of problems I'm having with this approach:

  • I cannot select multiple tables from my source postgresql dataset (neither from my target dataset Azure SQL database (1)). Azure gives me the option to select a single table or "None". I could cope with this problem if I would create several datasets each with a different table but that seems wrong and not feasible.
  • I am not sure what would happen if the data is already present in the Azure SQL database (1). In the "Data Flow" activity I have the option to select "Allow insert", "Allow update" or different key columns. In the "Copy Data" activity I do not have this opportunity.

Could someone guide me into the right direction please?

1

1 Answers

2
votes

There are three options to do the processing of the transformation:

  1. On data source side You can either put the logic into a stored procedure (not sure if that is possible with postgresql) or into sql query directly. Then fetch only the result table.

    • Less network traffic
    • more load on source, maybe not allowed
  2. Processing on Data Factory Integration Runtime This would be the option with Data Flow. Here the tables are copied to the integration runtime, then processed and then the result is copied to your sink. Since this is a quiet new option not a lot of connections are available. you might need to workaround with copy data to ASQL Server first.

    • probably slower, depending on the sizes of your resources
    • easy to build logic/visual interface
  3. Processing on sink Copy raw data to ASQL Server and run a query or stored procedure there.

    • most flexibility since you can resize ASQL if it takes more resources
    • fast processing
    • more flexible if delta ingestion and processing is possible (doing upserts / merges)
    • lot of maintenance

All three are valid options and completely depend on your use case and requirements (regarding operations, SLAs, ...)

About the problem with selecting multiple tables: You will need to do every table separately, but if they all follow the same logic, you can create a watermark table, lookup all the tablenames in there and loop over the copy module. That way you only have one pipeline that copies all the tables sequentially.

About the data present: copy only inserts the data. If you want to truncate the table first, you can add that as a "pre copy script".

If I forgot anything please comment, I will update the post accordingly