Why AWS claims Glue as a ETL tool? We need to code everything to pull data, no inbuilt functionality provided by Glue. Any benefits of using Glue instead of Nifi or some other ingestion tools?
1 Answers
Glue is a good ETL tool within AWS. Especially for big data work loads. After all it is running on spark.
Glue does have the ability to produce some basic automated transformation code -> Move data from A to B and remap column names etc.
However, it's the flexibility to write custom code that really sets it apart. Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala.
The benefits of Glue are really gained when it is used in conjunction with other AWS services. The Glue Data Catalog is shared with Athena and even AWS EMR, so you end up with a central point for your big data ecosystem.
One limitation of Glue I have found is writing large datasets to MS SQL Server (10 million rows+). Glue uses JDBC drivers, and as of 2020, there is yet to be a Microsoft JDBC connection that avails of bulk copy. So, effectively you are writing an insert statement for each row. Therefore, performance can suffer once you get into the 10s of millions of rows currently.