Where to start with Azure Data Factory

Question

I'm new to Azure Data Factory and I'm working on a proof of concept for my organisation, I'm finding it hard to get good information on fairly basic things and I'm hoping someone can point me to some good resources for my use case.

I know this question is quite general, but any help would be useful. I'm going around in circles at the moment and I feel like I'm wasting a lot of time. Something that would take me a few minutes in ssis has taken hours of research so far and I still haven't progressed much.

Here's the use case:

A gzip archive arrives in blob storage every hour, it has several .tsv files in it, but I want to extract one, which has web click stream data.
I want to extract this one .tsv file from the archive, append the datetime to the name and then save it to Azure data lake storage.
I want this to happen each time a new gzip archive arrives.

So far I have:

Azure Data Factory V2 setup
Linked Service setup to the blob container
Linked Service setup to data lake store Gen1
I think all the permissions and firewall issues sorted for ADF to access storage.

Is Azure Data Factory the right tool for this job? and if so, where do I go from here? How do I build datasets and a pipeline to achieve the use case and how do I schedule this to run when a new zip arrives?

Wang Zhang Wang Zhang · Accepted Answer · 2018-09-28T15:27:43

Azure Data Factory builts for the complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects, which is also the right tool for this job. Based on the current knowledge, you need to do the following setting in your data factory:

Create a pipeline to run the whole workflow, in which a Copy activity is involved, and the source dataset is blob and sink dataset is data lake store Gen1. Note that the source blob dataset refers to your blob linkedservice and the sink data lake store Gen1 refers to data lake store Gen1 linkedservice.
For the blob source dataset setting, set the compression type property as GZIP, this allows ADF to read GZIP compressed data from the blob.
Use event trigger to fire the pipeline run each time a new gzip archive arrives.

Where to start with Azure Data Factory

2 Answers