I read multiple posts on AWS Glue as ETL. But I couldn't get much. Can someone explain in simple words how AWS Glue works? It creates ENI but what is the use of it? I have read somewhere that AWS Glue job runs inside AWS Glue Private Subnet, is it true? Can you explain with architecture diagram? Also why do we need to provide VPC when creating Glue connections?
3
votes
AWS re:Invent 2016: NEW LAUNCH! Introduction to AWS Glue: A Fully Managed ETL Service (BDA209) - YouTube
– John Rotenstein
Did you read this docs.aws.amazon.com/glue/latest/dg/how-it-works.html ?
– Andrzej Sydor
For Glue to ETL your data, Glue needs access to your data. If that data is in a data store (e.g. MySQL DB) inside your VPC private subnet then Glue needs to drop an ENI into that subnet, otherwise it cannot access the (private) data source. If the data is available by public endpoint (e.g. in S3 or DynamoDB) then there's no need for Glue to run in your VPC.
– jarmod
1 Answers
1
votes
To make the concept as simple as possible, think about AWS Glue as Spark where you write a Python/ Scala script to perform a specific data processing task as a job where can be executed, for example, a Python script using GlueContext to read CSV file from S3 bucket and store it back as JSON.
In regards to networking and VPC, you can establish a private connection between your VPC and AWS Glue. You can use this connection to enable AWS Glue to communicate with the resources in your VPC without going through the public internet. With a VPC, you have control over your network settings, such as the IP address range, subnets, route tables, and network gateways.
AWS Glue Concepts: