2
votes

I'm new to Azure Data Factory and I'm working on a proof of concept for my organisation, I'm finding it hard to get good information on fairly basic things and I'm hoping someone can point me to some good resources for my use case.

I know this question is quite general, but any help would be useful. I'm going around in circles at the moment and I feel like I'm wasting a lot of time. Something that would take me a few minutes in ssis has taken hours of research so far and I still haven't progressed much.

Here's the use case:

  • A gzip archive arrives in blob storage every hour, it has several .tsv files in it, but I want to extract one, which has web click stream data.
  • I want to extract this one .tsv file from the archive, append the datetime to the name and then save it to Azure data lake storage.
  • I want this to happen each time a new gzip archive arrives.

So far I have:

  • Azure Data Factory V2 setup
  • Linked Service setup to the blob container
  • Linked Service setup to data lake store Gen1
  • I think all the permissions and firewall issues sorted for ADF to access storage.

Is Azure Data Factory the right tool for this job? and if so, where do I go from here? How do I build datasets and a pipeline to achieve the use case and how do I schedule this to run when a new zip arrives?

2

2 Answers

1
votes

Azure Data Factory builts for the complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects, which is also the right tool for this job. Based on the current knowledge, you need to do the following setting in your data factory:

  1. Create a pipeline to run the whole workflow, in which a Copy activity is involved, and the source dataset is blob and sink dataset is data lake store Gen1. Note that the source blob dataset refers to your blob linkedservice and the sink data lake store Gen1 refers to data lake store Gen1 linkedservice.
  2. For the blob source dataset setting, set the compression type property as GZIP, this allows ADF to read GZIP compressed data from the blob.
  3. Use event trigger to fire the pipeline run each time a new gzip archive arrives.
0
votes

In terms of getting help, guidance and documentation on Azure Data Factory gen 2, one of the best places is from within the designer itself. There is a help icon in the top right-hand side, offering links to the Guided Tour and Documentation:

Azure Data Factory designer

Guided tour is context sensitive, so it's worth clicking it different places to get help, eg in the Copy Activity, from within a dataset etc

The documentation has a mix of helpful features, from videos, tutorials, and the 5-min quickstarts and of course it's always kept up-to-date.

Finally Stack Overflow and MSDN are great resources for getting help on ADF. I'm pretty sure members of the product team come on and answer questions so you can't get better help than that. This tends to work best when you've got a specific question or error message and having something to show.