Azure blob copy partition by date

Question

I have files in an Azure blob storage like this:

entity
|- part001.parquet
|- part002.parquet
|- part003.parquet
|- part004.parquet
...

and this parquet data got a date column, let's name it transaction_date

I want to make an Azure Data Factory pipeline to migrate all this data into another blob storage, like this:

entity
|- YYYY
    |-MM
        |- entity_YYYYMMDD.parquet
           ...
|- YYYY
    |-MM
        |- entity_YYYYMMDD.parquet
           ...

So the files will contain only specific date transactions, based on transaction_date, to be easier to select them after.

Any way to do this using ADF or another Azure Stack tool?

databash databash · Accepted Answer · 2019-04-02T13:56:25

What you are after is dynamic partition or partition based on a field/column value.

We use Azure Databricks to handle such things and if need to be recurring then schedule the notebook via azure data factory v2. In the notebook you can have a pyspark script as follows (please note this code is just a pattern it is not tested)

extractData = spark.read.parquet("<<<input blob storage path>>>")

extractData = extractData\
              .withColumn("transaction_year", year("transaction_date"))\
              .withColumn("transaction_month", month("transaction_date"))

extractData.write.mode("overwrite") \
    .partitionBy("transaction_year", "transaction_month") \
    .parquet("<<<output blob storage path>>>")

Can we do using just azure data factory? Assuming you are using Azure Data Factory v2 - its hard (not impossible) to do partition based on a field value, compared to above.

Having said that there is public preview of Azure Data Factory Mapping Data Flow - under the covers it uses Azure Databricks for compute. I haven't tested/or played may be you could use a Transformation activity like conditional split. But again using Databricks is just simple.

Azure blob copy partition by date

1 Answers