I want to validate the schema before the ETL processing using AWS Glue. I am trying to do everything in Glue to avoid using Airflow or other tools.
The flow is S3 raw data -> crawl S3 data in Glue -> perform schema check -> basic ETL using AWS Glue (a basic select * for now) -> output to S3 -> perform ad hoc queries as a check before processing further using EC2 installed ETL software. The idea is if one step fails, I want to send out a notification (email or otherwise) of what failed and where.
Sample data file: the first table (OrderDate, region,...) from this link
Option 1: The AWS Glue ETL script performs a mapping of the fields within the script. If there is an invalid field type (e.g. an int is in the date column) will the script 'fail' and stop? I have not seen a way within the script to validate the schema only prior to processing.
Sample script line in PySpark:
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("orderdate", "string", "orderdate", "date"), ("region", "string", "region", "string"), ("rep", "string", "rep", "string"), ("item", "string", "item", "string"), ("units", "long", "units", "int"), ("unitcost", "double", "unitcost", "double"), ("total", "double", "total", "double")], transformation_ctx = "applymapping1")
Option 2: I was reading up on Glue Classifiers. I have built up a test CSV classifier. I am unsure how to apply it to my actual crawled data though as there are no options to link this together. If the classifier fails, will the ETL script still run?
Sample classifier: Sample Classifier
Workflows require triggers, triggers require ETL scripts, so I'm unsure how to add the classifier. I am assuming the classifier is imposed when the crawling is happening, but it is unclear how.