0
votes

I'm trying to do the ETL with open source frameworks, I have heard about two things Apache Beam and Apache Airflow, which one best for the entire ETL or ELT like Talend, Azure Data Factory, etc. and in fact, I'm trying to do everything with cloud data warehouses(redshift, azure data warehouse, snowflake, etc.) which one is good for these kinds of work and It would be great If I get some comparison between those two frameworks. Thanks in advance.

2
Can you share more examples of your data pipelines? Do you need to simply move data from one store to another? Do you want the jobs that run constantly or triggered periodically instead? Do you need streaming data support? What kind of data transformation do you do in process? How complex the transformation logic may be? What kind of programming/execution environment do you think you want to have? What kind of scale you are expecting?Anton

2 Answers

3
votes

Apache Airflow is not a ETL framework, it is schedule and monitor workflows application which will schedule and monitor your ETL pipeline. Apache Beam is a unified model for defining data processing workflows.

That mean your ETL pipelines will be written using Apache Beam and Airflow will trigger and schedule these pipelines.

0
votes

Apache Airflow: Is a scheduling and monitoring tool. You need to write your ETL scripts (Be it in Python or Scala) and run the same using Apache Airflow.

Tools like Talend, Informatica provides a lot of rich UI and built in functionality where you will can do simple stuffs like data dumping to highly complex transformations. Apart from that scheduling, orchestrating etc can be completed using it's own scheduling functionality.

Incase you are trying to build an enterprise class data warehouse having lot of complexity, I would suggest to goahead with an enterprise class ETL tool. This will give you a long term benefit interms of Manageability, Support, Debugging etc..