1
votes

I'm new to using AWS Glue and I don't understand how the ETL job gathers the data. I used a crawler to generate my table schema from some files in an S3 bucket and examined the autogenerated script in the ETL job, which is here (slightly modified):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

When I run this job, it successfully takes my data from the bucket that my crawler used to generate the table schema and it puts the data into my destination s3 bucket as expected.

My question is this: I don't see anywhere in this script where the data is "loaded", so to speak. I know I point it at the table that was generated by the crawler, but from this doc:

Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain metadata; they don't contain data from a data store.

If the table only contains metadata, how are the files from the data store (in my case, an S3 bucket) retrieved by the ETL job? I'm asking primarily because I'd like to somehow modify the ETL job to transform identically structured files in a different bucket without having to write a new crawler, but also because I'd like to strengthen my general understanding of the Glue service.

2

2 Answers

2
votes

The main thing to understand is: Glue datasource catalog (datebasess and tables) are always in sync with Athena,which is serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. You can either create tables/databases from Glue Console / Athena Query console.

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")

This above line of Glue Spark code is doing the magic for you in creating the initial dataframe using Glue data catalog source table, apart from the metadata, schema and table properties it also have the Location pointed to your Data Store (s3 location), where your data resides.

enter image description here

after applymapping has been done, this portion (datasink) of code is doing the actual loading of data into your target cluster/database.

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
0
votes

If you drill down deep into the AWS Glue Data Catalog. It has tables residing under the databases. By clicking on these tables you get exposed to the metadata which shows which the s3 folder where the the current table is being pointed towards as a result of the crawler run.

You can still create tables over an s3 structured file manually by adding tables via data catalog option:

enter image description here

and pointing it to your s3 location.

Another way is to use AWS-athena console to create tables pointing s3 locations. You would be using a regular create table script with the location field holding your s3 location.