How does AWS Glue ETL job retrieve data?

Question

I'm new to using AWS Glue and I don't understand how the ETL job gathers the data. I used a crawler to generate my table schema from some files in an S3 bucket and examined the autogenerated script in the ETL job, which is here (slightly modified):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

When I run this job, it successfully takes my data from the bucket that my crawler used to generate the table schema and it puts the data into my destination s3 bucket as expected.

My question is this: I don't see anywhere in this script where the data is "loaded", so to speak. I know I point it at the table that was generated by the crawler, but from this doc:

Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain metadata; they don't contain data from a data store.

If the table only contains metadata, how are the files from the data store (in my case, an S3 bucket) retrieved by the ETL job? I'm asking primarily because I'd like to somehow modify the ETL job to transform identically structured files in a different bucket without having to write a new crawler, but also because I'd like to strengthen my general understanding of the Glue service.

Vin Odh Vin Odh · Accepted Answer · 2018-08-23T21:12:51

The main thing to understand is: Glue datasource catalog (datebasess and tables) are always in sync with Athena,which is serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. You can either create tables/databases from Glue Console / Athena Query console.

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")

This above line of Glue Spark code is doing the magic for you in creating the initial dataframe using Glue data catalog source table, apart from the metadata, schema and table properties it also have the Location pointed to your Data Store (s3 location), where your data resides.

after applymapping has been done, this portion (datasink) of code is doing the actual loading of data into your target cluster/database.

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

How does AWS Glue ETL job retrieve data?

2 Answers