I'm trying to do the ETL with open source frameworks, I have heard about two things Apache Beam and Apache Airflow, which one best for the entire ETL or ELT like Talend, Azure Data Factory, etc. and in fact, I'm trying to do everything with cloud data warehouses(redshift, azure data warehouse, snowflake, etc.) which one is good for these kinds of work and It would be great If I get some comparison between those two frameworks. Thanks in advance.
2 Answers
Apache Airflow is not a ETL framework, it is schedule and monitor workflows application which will schedule and monitor your ETL pipeline. Apache Beam is a unified model for defining data processing workflows.
That mean your ETL pipelines will be written using Apache Beam and Airflow will trigger and schedule these pipelines.
Apache Airflow: Is a scheduling and monitoring tool. You need to write your ETL scripts (Be it in Python or Scala) and run the same using Apache Airflow.
Tools like Talend, Informatica provides a lot of rich UI and built in functionality where you will can do simple stuffs like data dumping to highly complex transformations. Apart from that scheduling, orchestrating etc can be completed using it's own scheduling functionality.
Incase you are trying to build an enterprise class data warehouse having lot of complexity, I would suggest to goahead with an enterprise class ETL tool. This will give you a long term benefit interms of Manageability, Support, Debugging etc..