0
votes

My current use case is, in an ETL based service (NOTE: The ETL service is not using the Glue ETL, it is an independent service), I am getting some data from AWS Redshift clusters into the S3. The data in S3 is then fed into the T and L jobs. I want to populate the metadata into the Glue Catalog. The most basic solution for this is to use the Glue Crawler, but the crawler runs for approximately 1 hour and 20 mins(lot of s3 partitions). The other solution that I came across is to use Glue API's. However, I am facing the issue of data type definition in the same.

Is there any way, I can create/update the Glue Catalog Tables where I have data in S3 and the data types are known only during the extraction process.

But also, when the T and L jobs are being run, the data types should be readily available in the catalog.

2
How many tables are present in your usecase?Prabhakar Reddy
There will be many tables, the glue catalog will serve only as a catalog for the data being populated into the S3. There will be around 100 jobs running generating 500-600 gbs of data in a single day.Priyank Verma
Got some leads, the redshift "CREATE EXTERNAL TABLE AS" will be useful while creating a new table, which will load the data into the S3 as well. When there will be updates in the table, like new partitions addition, then I will be using the Athena MSCK REPAIR TABLE to get the partitions in the table. However, I am still looking for an optimised solution, as MSCK will not scale well and the run time will increase as the size of dataset increases.Priyank Verma

2 Answers

0
votes

In order to create, update the data catalog during your ETL process, you can make use of the following:

Update:

additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]

sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
                                                    table_name=<dst_tbl_name>, transformation_ctx="write_sink",
                                                    additional_options=additionalOptions)
job.commit()

The above can be used to update the schema. You also have the option to set the updateBehavior choosing between LOG or UPDATE_IN_DATABASE (default).

Create

To create new tables in the data catalog during your ETL you can follow this example:

sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
                           enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
                           partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)

You can specify the database and new table name using setCatalogInfo.

You also have the option to update the partitions in the data catalog using the enableUpdateCatalog argument then specifying the partitionKeys.

A more detailed explanation on the functionality can be found here.

0
votes

Found a solution to the problem, I ended up utilising the Glue Catalog API's to make it seamless and fast. I created an interface which interacts with the Glue Catalog, and override those methods for various data sources. Right after the data has been loaded into the S3, I fire the query to get the schema from the source and then the interface does its work.